1. Preface 

2. Author Ackowledgements 
3. Student Welcome Letter 
4. Sampling and Data 


. Sampling and Data 

. Sampling and Data: Statistics (edited: Teegarden) 
. Probability 

. Key Terms 

. Data 

. Sampling 

. Sampling and Data: More on Sampling (edited: 


Teegarden) 


. Variation 

. Answers and Rounding Off 

. Frequency 

. Sampling and Data: Frequency, Relative Frequency, and 


. Summary 

. Practice: Sampling and Data 

. Sampling and Data: Homework (edited: Teegarden) 
15. 


Candy Lab 


5. Descriptive Statistics 


1. 
. Displaying Data 


WONAUNKRWHN 


Descriptive Statistics 


. Histograms 

. Box Plots 

. Measures of the Location of the Data 

. Measures of the Center of the Data 

. Skewness and the Mean, Median, and Mode 

. Descriptive Statistics: Measuring the Spread of the Data 


10. Chevyshev’s Theorem 
11. Summary of Formulas 
12. Practice 1: Center of the Data 
13. Practice 2: Spread of the Data 
14. Homework 
15. Descriptive Statistics: Descriptive Statistics Lab (edited: 
Teegarden) 
6. Probability Topics 
1. Probability Topics 
2. Terminology 
3. Independent and Mutually Exclusive Events 
4. Two Basic Rules of Probability 
5. Contingency Tables 
6 
7 
8 


. Summary of Formulas 
9. Practice 1: Contingency Tables 
10. Practice 2: Calculating Probabilities 
11. Homework 
12. Review 
13. Probability Topics: Probability Lab (edited: Teegarden) 
7. Discrete Random Variables 
1. Discrete Random Variables 
2. Probability Distribution Function (PDF) for a Discrete 
Random Variable 
. Mean or Expected Value and Standard Deviation 
. Common Discrete Probability Distribution Functions 
. Binomial 
. Geometric (optional) 
. Poisson 
. Summary of Functions 


WON DUB W 


10. Practice 1: Discrete Distribution 
11. Practice 2: Binomial Distribution 
12. Practice 3: Poisson Distribution 
13. Practice 4: Geometric Distribution 
14. Practice 5: Hypergeometric Distribution 
15. Homework 
16. Review 
17. Discrete Random Variables: Lab I (edited: Teegarden) 
8. Continuous Random Variables 
1. Continuous Random Variables 
2. Continuous Probability Functions 
3. The Uniform Distribution 
4. The Exponential Distribution 
5. Summary of the Uniform and Exponential Probability 
Distributions 
6. Practice 1: Uniform Distribution 
7. Practice 2: Exponential Distribution 
8. Homework 
9. Review 
10. Continuous Random Variables: Lab I 
9. The Normal Distribution 
1. The Normal Distribution 
. The Standard Normal Distribution 
. Z-Scores 
. Normal Distribution: Areas to the Left and Right of x 
. Calculations of Probabilities 
. Summary of Formulas 
. Practice: The Normal Distribution 
. Homework 
. Review 
. Normal Distribution: Normal Distribution Lab I (edited: 
Teegarden) 


WOON DAU BW N 


— 
js) 


10. The Central Limit Theorem 
1. The Central Limit Theorem 
2. The Central Limit Theorem for Sample Means (Averages) 
3. The Central Limit Theorem for Sums 
4. Using the Central Limit Theorem 
5. Summary of Formulas 
6. Practice: The Central Limit Theorem 
7. Homework 
8. Review 
9. Dice lab 
11. Confidence Intervals 
1. Confidence Intervals 


Deviation Unknown, Student-T 
4. Confidence Interval for a Population Proportion 
5. Sample Size 
6. Summary of Formulas 
7. Practice 1: Confidence Intervals for Averages, Known 
Population Standard Deviation 
8. Practice 2: Confidence Intervals for Averages, Unknown 
Population Standard Deviation 
9. Practice 3: Confidence Intervals for Proportions 
10. Homework 
11. Review 
12. Confidence Interval Lab 4 
12. Hypothesis Testing: Single Mean and Single Proportion 
1. Hypothesis Testing: Single Mean and Single Proportion 
2. Null and Alternate Hypotheses 
3. Outcomes and the Type I and Type II Errors 


4. Distribution Needed for Hypothesis Testing 


. Assumption 
. Rare Events 


. Decision and Conclusion 

. Additional Information 

. Summary of the Hypothesis Test 

. Examples 

. Summary of Formulas 

. Practice 1: Single Mean, Known Population Standard 


Deviation 


. Practice 2: Single Mean, Unknown Population Standard 


Deviation 


. Practice 3: Single Proportion 

. Homework 

. Review 

. Hypothesis Testing of Single Mean and Single Proportion: 


Lab (Edited: Teegarden) 02 


13. Hypothesis Testing: Two Means, Paired Data, Two Proportions 


1. 


_ 
CSCWOANAUNSA 


. Practice 2: Hypothesis Testing for Two Averages 
. Homework 
. Review 


Lab I (edited: Teegarden) 
14. The Chi-Square Distribution 
. The Chi-Square Distribution 
. Notation 
. Facts About the Chi-Square Distribution 
. Goodness-of-Fit Test 
. Test of Independence 
. Test of a Single Variance (Optional) 
. Summary of Formulas 
. Practice 1: Goodness-of-Fit Test 
. Practice 2: Contingency Tables 
. Practice 3: Test of a Single Variance 
. Homework 
12. Review 
13. Chi Squared Distribution Lab 

15. Linear Regression and Correlation 

1. Linear Regression and Correlation 

2. Linear Regression and Correlation: Linear Equations 

3. Linear Regression and Correlation: Slope and Y-Intercept 

of a Linear Equation 
. Scatter Plots 
. The Regression Equation 
. The Correlation Coefficient 
. Facts About the Correlation Coefficient for Linear 
Regression 
. Prediction 
. Outliers 
10. 95% Critical Values of the Sample Correlation Coefficient 
Table 

11. Linear Regression and Correlation: Summary 
12. Practice: Linear Regression 


a 
POoOoWOON AU BWN RP 


NI OF Bf 


WO © 


13. 


Homework 


14. Linear Regression and Correlation: Regression Lab II 


(edited: Teegarden) 


16. Appendix 


1. 


NI 


Practice Final Exam 1 


2. Practice Final Exam 2 
3. 
4. Group Projects 


Data Sets 


1. Collaborative Statistics: Group Project - Teegarden 


. Solution Sheets 


1. Solution Sheet: Hypothesis Testing for Single Mean 
and Single Proportion 
Paired Data, and Two Proportions | 
3. Solution Sheet: The Chi-Square Distribution 


. English Phrases Written Mathematically 
. Symbols and their Meanings 
. Formulas 


Preface 
This module introduces the Connexions online textbook "Collaborative 
Statistics" by Barbara Illowsky and Susan Dean. 


Welcome to Collaborative Statistics, presented by Connexions. The initial 
section below introduces you to Connexions. If you are familiar with 
Connexions, please skip to About "Collaborative Statistics." 


About Connexions 


Connexions Modular Content 

Connexions (cnx.org) is an online, open access educational resource 
dedicated to providing high quality learning materials free online, free in 
printable PDF format, and at low cost in bound volumes through print-on- 
demand publishing. The Collaborative Statistics textbook is one of many 
collections available to Connexions users. Each collection is composed of a 
number of re-usable learning modules written in the Connexions XML 
markup language. Each module may also be re-used (or 're-purposed') as 
part of other collections and may be used outside of Connexions. Including 
Collaborative Statistics, Connexions currently offers over 6500 modules 
and more than 350 collections. 


The modules of Collaborative Statistics are derived from the original paper 
version of the textbook under the same title, Collaborative Statistics. Each 
module represents a self-contained concept from the original work. 
Together, the modules comprise the original textbook. 


Re-use and Customization 

The Creative Commons (CC) Attribution license applies to all Connexions 
modules. Under this license, any module in Connexions may be used or 
modified for any purpose as long as proper attribution to the original 
author(s) is maintained. Connexions' authoring tools make re-use (or re- 
purposing) easy. Therefore, instructors anywhere are permitted to create 
customized versions of the Collaborative Statistics textbook by editing 
modules, deleting unneeded modules, and adding their own supplementary 
modules. Connexions’ authoring tools keep track of these changes and 
maintain the CC license's required attribution to the original authors. This 


process creates a new collection that can be viewed online, downloaded as a 
single PDF file, or ordered in any quantity by instructors and students as a 
low-cost printed textbook. To start building custom collections, please visit 
the help page, “Create a Collection with Existing Modules”. For a guide to 
authoring modules, please look at the help page, “Create a Module in 
Minutes”. 


Read the book online, print the PDF, or buy a copy of the book. 

To browse the Collaborative Statistics textbook online, visit the collection 
home page at cnx.org/content/col10522/latest. You will then have three 
options. 


1. You may obtain a PDF of the entire textbook to print or view offline 
by clicking on the “Download PDF” link in the “Content Actions” 
box. 

2. You may order a bound copy of the collection by clicking on the 
“Order Printed Copy” button. 

3. You may view the collection modules online by clicking on the “Start 
>>” link, which takes you to the first module in the collection. You can 
then navigate through the subsequent modules by using their “Next 
>>” and “Previous >>” links to move forward and backward in the 
collection. You can jump to any module in the collection by clicking 
on that module’s title in the “Collection Contents” box on the left side 
of the window. If these contents are hidden, make them visible by 
clicking on “[show table of contents]”. 


Accessibility and Section 508 Compliance 
e For information on general Connexions accessibility features, please 
visit http://cnx.org/content/m17212/latest/. 


e For information on accessibility features specific to the Collaborative 
Statistics textbook, please visit http://cnx.org/content/m17211/latest/. 


Version Change History and Errata 


e For a list of modifications, updates, and corrections, please visit 
http://cnx.org/content/m17360/latest/. 


Adoption and Usage 


e The Collaborative Statistics collection has been adopted and 
customized by a number of professors and educators for use in their 
classes. For a list of known versions and adopters, please visit 
http://cnx.org/content/m18261/latest/. 


About “Collaborative Statistics” 


Collaborative Statistics was written by Barbara Illowsky and Susan Dean, 
faculty members at De Anza College in Cupertino, California. The textbook 
was developed over several years and has been used in regular and honors- 
level classroom settings and in distance learning classes. Courses using this 
textbook have been articulated by the University of California for transfer 
of credit. The textbook contains full materials for course offerings, 
including expository text, examples, labs, homework, and projects. A 
Teacher’s Guide is currently available in print form and on the Connexions 
site at http://cnx.org/content/col10547/latest/, and supplemental course 
materials including additional problem sets and video lectures are available 
at http://cnx.org/content/col10586/latest/. The on-line text for each of these 
collections collections will meet the Section 508 standards for accessibility. 


An on-line course based on the textbook was also developed by Illowsky 
and Dean. It has won an award as the best on-line California community 
college course. The on-line course will be available at a later date as a 
collection in Connexions, and each lesson in the on-line course will be 
linked to the on-line textbook chapter. The on-line course will include, in 
addition to expository text and examples, videos of course lectures in 
captioned and non-captioned format. 


The original preface to the book as written by professors Illowsky and 
Dean, now follows: 


This book is intended for introductory statistics courses being taken by 
students at two— and four—year colleges who are majoring in fields other 
than math or engineering. Intermediate algebra is the only prerequisite. The 
book focuses on applications of statistical knowledge rather than the theory 


behind it. The text is named Collaborative Statistics because students learn 
best by doing. In fact, they learn best by working in small groups. The old 
saying “two heads are better than one” truly applies here. 


Our emphasis in this text is on four main concepts: 


e thinking statistically 

e incorporating technology 
¢ working collaboratively 
¢ writing thoughtfully 


These concepts are integral to our course. Students learn the best by 
actively participating, not by just watching and listening. Teaching should 
be highly interactive. Students need to be thoroughly engaged in the 
learning process in order to make sense of statistical concepts. 
Collaborative Statistics provides techniques for students to write across the 
curriculum, to collaborate with their peers, to think statistically, and to 
incorporate technology. 


This book takes students step by step. The text is interactive. Therefore, 
students can immediately apply what they read. Once students have 
completed the process of problem solving, they can tackle interesting and 
challenging problems relevant to today’s world. The problems require the 
students to apply their newly found skills. In addition, technology (TI-83 
graphing calculators are highlighted) is incorporated throughout the text and 
the problems, as well as in the special group activities and projects. The 
book also contains labs that use real data and practices that lead students 
step by step through the problem solving process. 


At De Anza, along with hundreds of other colleges across the country, the 
college audience involves a large number of ESL students as well as 
students from many disciplines. The ESL students, as well as the non-ESL 
students, have been especially appreciative of this text. They find it 
extremely readable and understandable. Collaborative Statistics has been 
used in classes that range from 20 to 120 students, and in regular, honor, 
and distance learning classes. 


Susan Dean 


Barbara Illowsky 


Author Ackowledgements 
This module contains the author acknowledgements for the Collaborative 
Statistics textbook/collection. 


For this second edition, we appreciate the tremendous feedback from De 
Anza College colleagues and students, as well as from the dozens of faculty 
around the world who taught out of the first and preliminary editions. We 
have updated Collaborative Statistics with contributions from many faculty 
and students. We especially thank Roberta Bloom, who wrote new problems 
and additional text. 


So many students and colleagues have contributed to the text, both the hard 
copy and open version. We thank the following people for their 
contributions to the first and/or second editions. 


At De Anza College: 

Dr. Inna Grushko (deceased), who wrote the glossary; Diane Mathios, who 
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Additional thanks: 

Dr. Larry Green of Lake Tahoe Community College, Terrie Teegarden of 
San Diego Mesa College, Ann Flanigan of Kapiolani Community College, 
Birgit Aquilonius of West Valley College. 


The conversion from a for-profit hard copy text to a free open textbook is 
the result of many individuals and organizations. We particularly thank Dr. 
Martha Kanter, Hal Plotkin, Dr. Judy Baker, Dr. Robert Maxfield of 
Maxfield Foundation, Hewlett Foundation, and Connexions. 


Finally, we owe much to Frank, Jeffrey, and Jessica Dean and to Dan, 
Rachel, Matthew, and Rebecca Illowsky, who encouraged us to continue 


with our work and who had to hear more than their share of “I’m sorry, I 
can’t” and “Just a minute, I’m working.” 


Student Welcome Letter 
Dear Student: 


Have you heard others say, “You’re taking statistics? That’s the hardest 
course I ever took!” They say that, because they probably spent the entire 
course confused and struggling. They were probably lectured to and never 
had the chance to experience the subject. You will not have that problem. 
Let’s find out why. 


There is a Chinese Proverb that describes our feelings about the field of 
Statistics: 


I HEAR, AND I FORGET 
I SEE, AND I REMEMBER 
I DO, AND I UNDERSTAND 


Statistics is a “do” field. In order to learn it, you must “do” it. We have 
structured this book so that you will have hands-on experiences. They will 
enable you to truly understand the concepts instead of merely going through 
the requirements for the course. 


What makes this book different from other texts? First, we have eliminated 
the drudgery of tedious calculations. You might be using computers or 
graphing calculators so that you do not need to struggle with algebraic 
manipulations. Second, this course is taught as a collaborative activity. With 
others in your class, you will work toward the common goal of learning this 
material. 


Here are some hints for success in your class: 


e Work hard and work every night. 

e Form a study group and learn together. 

e Don’t get discouraged - you can do it! 

e As you solve problems, ask yourself, “Does this answer make sense?” 
e Many statistics words have the same meaning as in everyday English. 


¢ Go to your teacher for help as soon as you need it. 

e Don’t get behind. 

e Read the newspaper and ask yourself, “Does this article make sense?” 
e Draw pictures - they truly help! 


Good luck and don’t give up! 


Sincerely, 
Susan Dean and Barbara Illowsky 


De Anza College 
21250 Stevens Creek Blvd. 
Cupertino, California 95014 


Sampling and Data 
This module provides a brief introduction to the field of statistics, including 
examples of how these topics shows up in a variety of real-life examples. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Recognize and differentiate between key terms. 
e Apply various types of sampling methods to data collection. 
¢ Create and interpret frequency tables. 


Introduction 


You are probably asking yourself the question, "When and where will I use 
Statistics?". If you read any newspaper or watch television, or use the 
Internet, you will see statistical information. There are statistics about 
crime, sports, education, politics, and real estate. Typically, when you read a 
newspaper article or watch a news program on television, you are given 
sample information. With this information, you may make a decision about 
the correctness of a statement, claim, or "fact." Statistical methods can help 
you make the "best educated guess." 


Since you will undoubtedly be given statistical information at some point in 
your life, you need to know some techniques to analyze the information 
thoughtfully. Think about buying a house or managing a budget. Think 
about your chosen profession. The fields of economics, business, 
psychology, education, biology, law, computer science, police science, and 
early childhood development require at least one course in statistics. 


Included in this chapter are the basic ideas and words of probability and 
Statistics. You will soon understand that statistics and probability work 
together. You will also learn how data are gathered and what "good" data 
are. 


Sampling and Data: Statistics (edited: Teegarden) 


The science of statistics deals with the collection, analysis, interpretation, 
and presentation of data. We see and use data in our everyday lives. To be 
able to use data correctly is essential to many professions and is in your 
own best self-interest. 


Example 


Suppose class members write down the average time (in hours, to the 
nearest half-hour) they sleep per night. Then create a simple graph (called a 
dot plot) of the data. A dot plot consists of a number line and dots (or 
points) positioned above the number line. For example, consider the 
following data: 


po.00060.56.56.00.0 775609 


The dot plot for this data would be as follows: 
Frequency of Average Time (in Hours) Spent Sleeping per Night 


o 
o Oo 
o oO oO o 
o 0 0 0 °o o o 
5 6 7 8 9 


In this course, you will learn how to organize and summarize data. 
Organizing and summarizing data is called descriptive statistics. Two ways 
to summarize data are by graphing and by numbers (for example, finding an 
average). After you have studied probability and probability distributions, 
you will use formal methods for drawing conclusions from "good" data. 
The formal methods are called inferential statistics. Statistical inference 
uses probability to determine if conclusions drawn are reliable or not. 


Effective interpretation of data (inference) is based on good procedures for 
producing data and thoughtful examination of the data. You will encounter 
what will seem to be too many mathematical formulas for interpreting data. 
The goal of statistics is not to perform numerous calculations using the 
formulas, but to gain an understanding of your data. The calculations can be 
done using a calculator or a computer. The understanding must come from 
you. If you can thoroughly grasp the basics of statistics, you can be more 
confident in the decisions you make in life. 


Glossary 


Data 

A set of observations (a set of possible outcomes). Most data can be 
put into two groups: qualitative (hair color, ethnic groups and many 
other attributes of population) and quantitative (distance traveled to 
college, number of children in a family, etc.). In its turn quantitative 
data can be separated into two subgroups: discrete and continuous. 
Roughly speaking, data is discrete if it is result of counting (a number 
of student of the given ethnic group in a class, a number of books on a 
shelf, etc.), and data is continuous if it is result of measuring (distance 
traveled, weight of luggage, etc.) 


Statistic 
A numerical characteristic of the sample. Statistic estimates the 
corresponding population parameter. For example, the average number 
of full-time students in a 7:30 a.m. class for this term (statistic) is an 
estimate for the average number of full-time students in any class this 
term (parameter). 


Probability 
This module introduces the concept of probability as a mathematical 
measure of randomness, including a number of real-world applications. 


Probability is a mathematical tool used to study randomness. It deals with 
the chance (the likelihood) of an event occurring. For example, if you toss a 
fair coin 4 times, the outcomes may not be 2 heads and 2 tails. However, if 
you toss the same coin 4,000 times, the outcomes will be close to half heads 
and half tails. The expected theoretical probability of heads in any one toss 
is + or 0.5. Even though the outcomes of a few repetitions are uncertain, 
there is a regular pattern of outcomes when there are many repetitions. 
After reading about the English statistician Karl Pearson who tossed a coin 
24,000 times with a result of 12,012 heads, one of the authors tossed a coin 
996 


2,000 times. The results were 996 heads. The fraction >555 is equal to 0.498 


which is very close to 0.5, the expected probability. 


The theory of probability began with the study of games of chance such as 
poker. Predictions take the form of probabilities. To predict the likelihood 
of an earthquake, of rain, or whether you will get an A in this course, we 
use probabilities. Doctors use probability to determine the chance of a 
vaccination causing the disease the vaccination is supposed to prevent. A 
stockbroker uses probability to determine the rate of return on a client's 
investments. You might use probability to decide to buy a lottery ticket or 
not. In your study of statistics, you will use the power of mathematics 
through probability calculations to analyze and interpret your data. 


Glossary 


Probability 
A number between 0 and 1, inclusive, that gives the likelihood that a 
specific event will occur. The foundation of statistics is given by the 
following 3 axioms (by A. N. Kolmogorov, 1930’s): Let S denote the 
sample space and A and B are two events in S . Then: 


© 0< P(A) <1;. 
e If A and B are any two mutually exclusive events, then 
P(A or B) = P(A) + P(B). 


Key Terms 
This module introduces a number of key terms related to statistical 
sampling and data. 


In statistics, we generally want to study a population. You can think of a 
population as an entire collection of persons, things, or objects under study. 
To study the larger population, we select a sample. The idea of sampling is 
to select a portion (or subset) of the larger population and study that portion 
(the sample) to gain information about the population. Data are the result of 
sampling from a population. 


Because it takes a lot of time and money to examine an entire population, 
sampling is a very practical technique. If you wished to compute the overall 
grade point average at your school, it would make sense to select a sample 
of students who attend the school. The data collected from the sample 
would be the students' grade point averages. In presidential elections, 
opinion poll samples of 1,000 to 2,000 people are taken. The opinion poll is 
supposed to represent the views of the people in the entire country. 
Manufacturers of canned carbonated drinks take samples to determine if a 
16 ounce can contains 16 ounces of carbonated drink. 


From the sample data, we can calculate a statistic. A statistic is a number 
that is a property of the sample. For example, if we consider one math class 
to be a sample of the population of all math classes, then the average 
number of points earned by students in that one math class at the end of the 
term is an example of a statistic. The statistic is an estimate of a population 
parameter. A parameter is a number that is a property of the population. 
Since we considered all math classes to be the population, then the average 
number of points earned per student over all the math classes is an example 
of a parameter. 


One of the main concerns in the field of statistics is how accurately a 
Statistic estimates a parameter. The accuracy really depends on how well the 
sample represents the population. The sample must contain the 
characteristics of the population in order to be a representative sample. We 
are interested in both the sample statistic and the population parameter in 
inferential statistics. In a later chapter, we will use the sample statistic to 
test the validity of the established population parameter. 


A variable, notated by capital letters like X and Y, is a characteristic of 
interest for each person or thing in a population. Variables may be 
numerical or categorical. Numerical variables take on values with equal 
units such as weight in pounds and time in hours. Categorical variables 
place the person or thing into a category. If we let X equal the number of 
points earned by one math student at the end of a term, then X isa 
numerical variable. If we let Y be a person's party affiliation, then examples 
of Y include Republican, Democrat, and Independent. Y is a categorical 
variable. We could do some math with values of X (calculate the average 
number of points earned, for example), but it makes no sense to do math 
with values of Y (calculating an average party affiliation makes no sense). 


Data are the actual values of the variable. They may be numbers or they 
may be words. Datum is a single value. 


Two words that come up often in statistics are mean and proportion. If you 
were to take three exams in your math classes and obtained scores of 86, 
75, and 92, you calculate your mean score by adding the three exam scores 
and dividing by three (your mean score would be 84.3 to one decimal 
place). If, in your math class, there are 40 students and 22 are men and 18 


are women, then the proportion of men students is a. and the proportion of 


women students is 3 Mean and proportion are discussed in more detail in 


later chapters. 


Note: 

Mean and Average 

The words "mean" and "average" are often used interchangeably. The 
substitution of one word for the other is common practice. The technical 
term is "arithmetic mean" and "average" is technically a center location. 
However, in practice among non-statisticians, "average" is commonly 
accepted for "arithmetic mean." 


Example: 


Exercise: 


Problem: 


Define the key terms from the following study: We want to know the 
average (mean) amount of money first year college students spend at 
ABC College on school supplies that do not include books. We 
randomly survey 100 first year students at the college. Three of those 
students spent $150, $200, and $225, respectively. 


Solution: 


The population is all first year students attending ABC College this 
term. 


The sample could be all students enrolled in one section of a 
beginning statistics course at ABC College (although this sample may 
not represent the entire population). 


The parameter is the average (mean) amount of money spent 
(excluding books) by first year college students at ABC College this 
term. 


The statistic is the average (mean) amount of money spent (excluding 
books) by first year college students in the sample. 


The variable could be the amount of money spent (excluding books) 
by one first year student. Let X = the amount of money spent 
(excluding books) by one first year student attending ABC College. 


The data are the dollar amounts spent by the first year students. 
Examples of the data are $150, $200, and $225. 


Optional Collaborative Classroom Exercise 


Do the following exercise collaboratively with up to four people per group. 
Find a population, a sample, the parameter, the statistic, a variable, and data 
for the following study: You want to determine the average (mean) number 
of glasses of milk college students drink per day. Suppose yesterday, in your 
English class, you asked five students how many glasses of milk they drank 
the day before. The answers were 1, 0, 1, 3, and 4 glasses of milk. 


Glossary 


Average 
A number that describes the central tendency of the data. There are a 
number of specialized averages, including the arithmetic mean, 
weighted mean, median, mode, and geometric mean. 


Data 
A set of observations (a set of possible outcomes). Most data can be 
put into two groups: qualitative (hair color, ethnic groups and other 
attributes of the population) and quantitative (distance traveled to 
college, number of children in a family, etc.). Quantitative data can be 
separated into two subgroups: discrete and continuous. Data is 
discrete if it is the result of counting (the number of students of a given 
ethnic group in a class, the number of books on a shelf, etc.). Data is 
continuous if it is the result of measuring (distance traveled, weight of 
luggage, etc.) 


Proportion 


e Asa number: A proportion is the number of successes divided by 
the total number in the sample. 

e Asa probability distribution: Given a binomial random variable 
(RV), X ~B(n, p), consider the ratio of the number X of 
successes in nm Bernouli trials to the number n of trials. P/= ~. 
This new RV is called a proportion, and if the number of trials, n, 
is large enough, P?~N p, © . 


Data 

This module introduces the concepts of qualitative data, quantitative 
continuous data, and quantitative discrete data as used in statistics. Sample 
problems are included. 


Data may come from a population or from a sample. Small letters like x or 
y generally are used to represent data values. Most data can be put into the 
following categories: 


e¢ Qualitative 
¢ Quantitative 


Qualitative data are the result of categorizing or describing attributes of a 
population. Hair color, blood type, ethnic group, the car a person drives, and 
the street a person lives on are examples of qualitative data. Qualitative data 
are generally described by words or letters. For instance, hair color might 
be black, dark brown, light brown, blonde, gray, or red. Blood type might 
be AB+, O-, or B+. Researchers often prefer to use quantitative data over 
qualitative data because it lends itself more easily to mathematical analysis. 
For example, it does not make sense to find an average hair color or blood 


type. 


Quantitative data are always numbers. Quantitative data are the result of 
counting or measuring attributes of a population. Amount of money, pulse 
rate, weight, number of people living in your town, and the number of 
students who take statistics are examples of quantitative data. Quantitative 
data may be either discrete or continuous. 


All data that are the result of counting are called quantitative discrete 
data. These data take on only certain numerical values. If you count the 
number of phone calls you receive for each day of the week, you might get 
0, 1, 2, 3, etc. 


All data that are the result of measuring are quantitative continuous data 
assuming that we can measure accurately. Measuring angles in radians 
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backpacks are discrete data and the weights of the backpacks are continuous 
data. 


Note:In this course, the data used is mainly quantitative. It is easy to 
calculate statistics (like the mean or proportion) from numbers. In the 
chapter Descriptive Statistics, you will be introduced to stem plots, 
histograms and box plots all of which display quantitative data. Qualitative 
data is discussed at the end of this section through graphs. 


Example: 

Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in their backpacks. You 
sample five students. Two students carry 3 books, one student carries 4 
books, one student carries 2 books, and one student carries 1 book. The 
numbers of books (3, 4, 2, and 1) are the quantitative discrete data. 


Example: 

Data Sample of Quantitative Continuous Data 

The data are the weights of the backpacks with the books in it. You sample 
the same five students. The weights (in pounds) of their backpacks are 6.2, 
7, 6.8, 9.1, 4.3. Notice that backpacks carrying three books can have 
different weights. Weights are quantitative continuous data because 
weights are measured. 


Example: 

Data Sample of Qualitative Data 

The data are the colors of backpacks. Again, you sample the same five 
students. One student has a red backpack, two students have black 
backpacks, one student has a green backpack, and one student has a gray 
backpack. The colors red, black, black, green, and gray are qualitative data. 


Note: You may collect data as numbers and report it categorically. For 
example, the quiz scores for each student are recorded throughout the term. 
At the end of the term, the quiz scores are reported as A, B, C, D, or F. 


Example: 
Exercise: 


Problem: 


Work collaboratively to determine the correct data type (quantitative 
or qualitative). Indicate whether quantitative data are continuous or 
discrete. Hint: Data that are discrete often start with the words "the 
number of." 
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. The number of pairs of shoes you own. 

. The type of car you drive. 

. Where you go on vacation. 

. The distance it is from your home to the nearest grocery store. 


The number of classes you take per school year. 


. The tuition for your classes 


The type of calculator you use. 


. Movie ratings. 

. Political party preferences. 

. Weight of sumo wrestlers. 

. Amount of money won playing poker. 

. Number of correct answers on a quiz. 

. Peoples’ attitudes toward the government. 

. IQ scores. (This may cause some discussion. ) 


Solution: 


Items 1, 5, 11, and 12 are quantitative discrete; items 4, 6, 10, and 14 
are quantitative continuous; and items 2, 3, 7, 8, 9, and 13 are 
qualitative. 


Qualitative Data Discussion 

Below are tables of part-time vs full-time students at De Anza College in 
Cupertino, CA and Foothill College in Los Altos, CA for the Spring 2010 
quarter. The tables display counts (frequencies) and percentages or 
proportions (relative frequencies). The percent columns make comparing 
the same categories in the colleges easier. Displaying percentages along 
with the numbers is often helpful, but it is particularly important when 
comparing sets of data that do not have the same totals, such as the total 
enrollments for both colleges in this example. Notice how much larger the 
percentage for part-time students at Foothill College is compared to De 
Anza College. 


Number Percent 
Full-time 9,200 40.9% 
Part-time 13,296 59.1% 
Total 22,496 100% 
De Anza College 
Number Percent 
Full-time 4,059 28.6% 


Part-time 10,124 71.4% 


Total 14,183 100% 
Foothill College 


Tables are a good way of organizing and displaying data. But graphs can be 
even more helpful in understanding the data. There are no strict rules 
concerning what graphs to use. Below are pie charts and bar graphs, two 
graphs that are used to display qualitative data. 


In a pie chart, categories of data are represented by wedges in the circle 
and are proportional in size to the percent of individuals in each category. 


In a bar graph, the length of the bar for each category is proportional to the 
number or percent of individuals in each category. Bars may be vertical or 
horizontal. 


A Pareto chart consists of bars that are sorted into order by category size 
(largest to smallest). 


Look at the graphs and determine which graph (pie or bar) you think 
displays the comparisons better. This is a matter of preference. 


It is a good idea to look at a variety of graphs to see which is the most 
helpful in displaying the data. We might make different choices of what we 
think is the "best" graph depending on the data and the context. Our choice 
also depends on what we are using the data for. 


De Anza College 


@ Full Time o Part Time 


Student Status 
Pat Tire Pat Time 


Pert Tire 


De Anza Foothill 


Foothill College 


@ Full Time & Part Time 


Full Time 
28.6% 


Part Time 
71.4% 


Percentages That Add to More (or Less) Than 100% 

Sometimes percentages add up to be more than 100% (or less than 100%). 
In the graph, the percentages add to more than 100% because students can 
be in more than one category. A bar graph is appropriate to compare the 
relative size of the categories. A pie chart cannot be used. It also could not 
be used if the percentages added to less than 100%. 


Characteristic/Category Percent 


Full-time Students 40.9% 
Students who intend to transfer to a 4-year educational 48.6% 
institution 

Students under age 25 61.0% 
TOTAL 150.5% 


De Anza College Spring 2010 


Omitting Categories/Missing Data 

The table displays Ethnicity of Students but is missing the 
"Other/Unknown" category. This category contains people who did not feel 
they fit into any of the ethnicity categories or declined to respond. Notice 
that the frequencies do not add up to the total number of students. Create a 
bar graph and not a pie chart. 


Frequency Percent 


Asian 8,794 36.1% 

Black 1,412 5.8% 

Filipino 1,298 5.3% 

Hispanic 4,180 17.1% 

Native American 146 0.6% 

Pacific Islander 236 1.0% 

White 5,978 24.5% 

TOTAL 22,044 out of 24,382 90.4% out of 100% 


Missing Data: Ethnicity of Students De Anza College Fall Term 2007 
(Census Day) 


E thnicity of Students 


40.0% 


35.0% 
20.0% 
25.0% 
20.0% 
15.0% 
10.0% 

5.0% 


°* Asian Black Filipino Hispanic Native Pacific White 
Amer. Islander 
36.1% 58% 53% 17.1% 06% 1.0% 


Bar graph Without Other/Unknown Category 


The following graph is the same as the previous graph but the 
"Other/Unknown" percent (9.6%) has been added back in. The 
"Other/Unknown" category is large compared to some of the other 
categories (Native American, 0.6%, Pacific Islander 1.0% particularly). 
This is important to know when we think about what the data are telling us. 


This particular bar graph can be hard to understand visually. The graph 
below it is a Pareto chart. The Pareto chart has the bars sorted from largest 
to smallest and is easier to read and interpret. 


Ethnicity of Students 


Bar Graph With Other/Unknown Category 


Pareto Chart With Bars Sorted By Size 


Pie Charts: No Missing Data 

The following pie charts have the "Other/Unknown" category added back in 
(since the percentages must add to 100%). The chart on the right is 
organized having the wedges by size and makes for a more visually 
informative graph than the unsorted, alphabetical graph on the left. 


Ethnicity of Students Ethnicity of Students 


Pacific Islander 


Glossary 


Continuous Random Variable 
A random variable (RV) whose outcomes are measured. 


Example: 
The height of trees in the forest is a continuous RV. 


Data 
A set of observations (a set of possible outcomes). Most data can be 
put into two groups: qualitative (hair color, ethnic groups and other 
attributes of the population) and quantitative (distance traveled to 
college, number of children in a family, etc.). Quantitative data can be 
separated into two subgroups: discrete and continuous. Data is 
discrete if it is the result of counting (the number of students of a given 
ethnic group in a class, the number of books on a shelf, etc.). Data is 
continuous if it is the result of measuring (distance traveled, weight of 
luggage, etc.) 


Discrete Random Variable 
A random variable (RV) whose outcomes are counted. 


Qualitative Data 
See Data. 


Quantitative Data 
See Data. 


Sampling 

This module introduces the concept of statistical sampling. Students are 
taught the difference between a simple random sample, stratified sample, 
cluster sample, systematic sample, and convenience sample. Example 
problems are provided, including an optional classroom activity. 


Gathering information about an entire population often costs too much or is 
virtually impossible. Instead, we use a sample of the population. A sample 
should have the same characteristics as the population it is 
representing. Most statisticians use various methods of random sampling 
in an attempt to achieve this goal. This section will describe a few of the 
most common methods. 


There are several different methods of random sampling. In each form of 
random sampling, each member of a population initially has an equal 
chance of being selected for the sample. Each method has pros and cons. 
The easiest method to describe is called a simple random sample. Any 
group of n individuals is equally likely to be chosen by any other group of n 
individuals if the simple random sampling technique is used. In other 
words, each sample of the same size has an equal chance of being selected. 
For example, suppose Lisa wants to form a four-person study group (herself 
and three other people) from her pre-calculus class, which has 31 members 
not including Lisa. To choose a simple random sample of size 3 from the 
other members of her class, Lisa could put all 31 names in a hat, shake the 
hat, close her eyes, and pick out 3 names. A more technological way is for 
Lisa to first list the last names of the members of her class together with a 
two-digit number as shown below. 


ID Name 


00 Anselmo 


ID 


01 


02 


03 


04 


05 


06 


07 


08 


09 


10 


11 


12 


13 


14 


15 


16 


17 


Name 
Bautista 
Bayani 
Cheng 
Cuarismo 
Cuningham 
Fontecha 
Hong 
Hoobler 
Jiao 

Khan 
King 
Legeny 
Lundquist 
Macierz 
Motogawa 
Okimoto 


Patel 


ID 


18 


19 


20 


21 


22 


23 


24 


25 


26 


27. 


28 


29 


30 


Class Roster 


Name 
Price 
Quizon 
Reyes 
Roquero 
Roth 
Rowell 
Salangsang 
Slade 
Stracher 
Tallai 
Tran 
Wai 


Wood 


Lisa can either use a table of random numbers (found in many statistics 
books as well as mathematical handbooks) or a calculator or computer to 
generate random numbers. For this example, suppose Lisa chooses to 
generate random numbers from a calculator. The numbers generated are: 


94360 .99832 .14669 .51470 .40581 .73381 .04399 


Lisa reads two-digit groups until she has chosen three class members (that 
is, she reads .94360 as the groups 94, 43, 36, 60). Each random number 
may only contribute one class member. If she needed to, Lisa could have 
generated more random numbers. 


The random numbers .94360 and .99832 do not contain appropriate two 
digit numbers. However the third random number, .14669, contains 14 (the 
fourth random number also contains 14), the fifth random number contains 
05, and the seventh random number contains 04. The two-digit number 14 
corresponds to Macierz, 05 corresponds to Cunningham, and 04 
corresponds to Cuarismo. Besides herself, Lisa's group will consist of 
Marcierz, and Cunningham, and Cuarismo. 


Besides simple random sampling, there are other forms of sampling that 
involve a chance process for getting the sample. Other well-known 
random sampling methods are the stratified sample, the cluster sample, 
and the systematic sample. 


To choose a stratified sample, divide the population into groups called 
strata and then take a proportionate number from each stratum. For 
example, you could stratify (group) your college population by department 
and then choose a proportionate simple random sample from each stratum 
(each department) to get a stratified random sample. To choose a simple 
random sample from each department, number each member of the first 
department, number each member of the second department and do the 
same for the remaining departments. Then use simple random sampling to 
choose proportionate numbers from the first department and do the same for 
each of the remaining departments. Those numbers picked from the first 
department, picked from the second department and so on represent the 
members who make up the stratified sample. 


To choose a cluster sample, divide the population into clusters (groups) 
and then randomly select some of the clusters. All the members from these 
clusters are in the cluster sample. For example, if you randomly sample four 
departments from your college population, the four departments make up 
the cluster sample. For example, divide your college faculty by department. 
The departments are the clusters. Number each department and then choose 


four different numbers using simple random sampling. All members of the 
four departments with those numbers are the cluster sample. 


To choose a systematic sample, randomly select a starting point and take 
every nth piece of data from a listing of the population. For example, 
suppose you have to do a phone survey. Your phone book contains 20,000 
residence listings. You must choose 400 names for the sample. Number the 
population 1 - 20,000 and then use a simple random sample to pick a 
number that represents the first name of the sample. Then choose every 
50th name thereafter until you have a total of 400 names (you might have to 
go back to the of your phone list). Systematic sampling is frequently chosen 
because it is a simple method. 


A type of sampling that is nonrandom is convenience sampling. 
Convenience sampling involves using results that are readily available. For 
example, a computer software store conducts a marketing study by 
interviewing potential customers who happen to be in the store browsing 
through the available software. The results of convenience sampling may be 
very good in some cases and highly biased (favors certain outcomes) in 
others. 


Sampling data should be done very carefully. Collecting data carelessly can 
have devastating results. Surveys mailed to households and then returned 
may be very biased (for example, they may favor a certain group). It is 
better for the person conducting the survey to select the sample 
respondents. 


True random sampling is done with replacement. That is, once a member 
is picked that member goes back into the population and thus may be 
chosen more than once. However for practical reasons, in most populations, 
simple random sampling is done without replacement. Surveys are 
typically done without replacement. That is, a member of the population 
may be chosen only once. Most samples are taken from large populations 
and the sample tends to be small in comparison to the population. Since this 
is the case, sampling without replacement is approximately the same as 
sampling with replacement because the chance of picking the same 
individual more than once using with replacement is very low. 


For example, in a college population of 10,000 people, suppose you want to 
randomly pick a sample of 1000 for a survey. For any particular sample 
of 1000, if you are sampling with replacement, 


e the chance of picking the first person is 1000 out of 10,000 (0.1000); 

e the chance of picking a different second person for this sample is 999 
out of 10,000 (0.0999); 

e the chance of picking the same person again is 1 out of 10,000 (very 
low). 


If you are sampling without replacement, 


e the chance of picking the first person for any particular sample is 1000 
out of 10,000 (0.1000); 

e the chance of picking a different second person is 999 out of 9,999 
(0.0999); 

e you do not replace the first person before picking the next person. 


Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the 
decimal answers to 4 place decimals. To 4 decimal places, these numbers 
are equivalent (0.0999). 


Sampling without replacement instead of sampling with replacement only 
becomes a mathematics issue when the population is small which is not that 
common. For example, if the population is 25 people, the sample is 10 and 
you are sampling with replacement for any particular sample, 


e the chance of picking the first person is 10 out of 25 and a different 
second person is 9 out of 25 (you replace the first person). 


If you sample without replacement, 


e the chance of picking the first person is 10 out of 25 and then the 
second person (which is different) is 9 out of 24 (you do not replace 
the first person). 


Compare the fractions 9/25 and 9/24. To 4 decimal places, 9/25 = 0.3600 
and 9/24 = 0.3750. To 4 decimal places, these numbers are not equivalent. 


When you analyze data, it is important to be aware of sampling errors and 
nonsampling errors. The actual process of sampling causes sampling errors. 
For example, the sample may not be large enough. Factors not related to the 
sampling process cause nonsampling errors. A defective counting device 
can cause a nonsampling error. 


In reality, a sample will never be exactly representative of the population so 
there will always be some sampling error. As a rule, the larger the sample, 
the smaller the sampling error. 


In statistics, a sampling bias is created when a sample is collected from a 
population and some members of the population are not as likely to be 
chosen as others (remember, each member of the population should have an 
equally likely chance of being chosen). When a sampling bias happens, 
there can be incorrect conclusions drawn about the population that is being 
studied. 


Example: 
Exercise: 


Problem: 


Determine the type of sampling used (simple random, stratified, 
systematic, cluster, or convenience). 


1. A soccer coach selects 6 players from a group of boys aged 8 to 
10, 7 players from a group of boys aged 11 to 12, and 3 players 
from a group of boys aged 13 to 14 to form a recreational soccer 
team. 

2. A pollster interviews all human resource personnel in five 
different high tech companies. 

3. A high school educational researcher interviews 50 high school 
female teachers and 50 high school male teachers. 

4. A medical researcher interviews every third cancer patient from 
a list of cancer patients at a local hospital. 


5. A high school counselor uses a computer to generate 50 random 
numbers and then picks students whose names correspond to the 
numbers. 

6. A student interviews classmates in his algebra class to determine 
how many pairs of jeans a student owns, on the average. 


Solution: 


. stratified 

. cluster 

. stratified 

. systematic 

. simple random 
. convenience 
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If we were to examine two samples representing the same population, even 
if we used random sampling methods for the samples, they would not be 
exactly the same. Just as there is variation in data, there is variation in 
samples. As you become accustomed to sampling, the variability will seem 
natural. 


Example: 

Suppose ABC College has 10,000 part-time students (the population). We 
are interested in the average amount of money a part-time student spends 
on books in the fall term. Asking all 10,000 students is an almost 
impossible task. 

Suppose we take two different samples. 

First, we use convenience sampling and survey 10 students from a first 
term organic chemistry class. Many of these students are taking first term 
calculus in addition to the organic chemistry class . The amount of money 
they spend is as follows: 

$128 $87 $173 $116 $130 $204 $147 $189 $93 $153 


The second sample is taken by using a list from the P.E. department of 
senior citizens who take P.E. classes and taking every 5th senior citizen on 
the list, for a total of 10 senior citizens. They spend: 

$50 $40 $36 $15 $50 $100 $40 $53 $22 $22 

Exercise: 


Problem: 


Do you think that either of these samples is representative of (or is 
characteristic of) the entire 10,000 part-time student population? 


Solution: 


No. The first sample probably consists of science-oriented students. 
Besides the chemistry course, some of them are taking first-term 
calculus. Books for these classes tend to be expensive. Most of these 
students are, more than likely, paying more than the average part-time 
student for their books. The second sample is a group of senior 
citizens who are, more than likely, taking courses for health and 
interest. The amount of money they spend on books is probably much 
less than the average part-time student. Both samples are biased. Also, 
in both cases, not all students have a chance to be in either sample. 


Exercise: 
Problem: 


Since these samples are not representative of the entire population, is 
it wise to use the results to describe the entire population? 


Solution: 


No. For these samples, each member of the population did not have an 
equally likely chance of being chosen. 


Now, suppose we take a third sample. We choose ten different part-time 
students from the disciplines of chemistry, math, English, psychology, 
sociology, history, nursing, physical education, art, and early childhood 
development. (We assume that these are the only disciplines in which part- 
time students at ABC College are enrolled and that an equal number of 


part-time students are enrolled in each of the disciplines.) Each student is 
chosen using simple random sampling. Using a calculator, random 
numbers are generated and a student from a particular discipline is selected 
if he/she has a corresponding number. The students spend: 

$180 $50 $150 $85 $260 $75 $180 $200 $200 $150 

Exercise: 


Problem: Is the sample biased? 
Solution: 


The sample is unbiased, but a larger sample would be recommended 
to increase the likelihood that the sample will be close to 
representative of the population. However, for a biased sampling 
technique, even a large sample runs the risk of not being 
representative of the population. 


Students often ask if it is "good enough" to take a sample, instead of 
surveying the entire population. If the survey is done well, the answer is 
yes. 


Optional Collaborative Classroom Exercise 


Exercise: 


Problem: 


As a class, determine whether or not the following samples are 
representative. If they are not, discuss the reasons. 


1. To find the average GPA of all students in a university, use all 
honor students at the university as the sample. 

2. To find out the most popular cereal among young people under 
the age of 10, stand outside a large supermarket for three hours 
and speak to every 20th child under age 10 who enters the 
supermarket. 


3. To find the average annual income of all adults in the United 
States, sample U.S. congressmen. Create a cluster sample by 
considering each state as a stratum (group). By using simple 
random sampling, select states to be part of the cluster. Then 
survey every U.S. congressman in the cluster. 

4. To determine the proportion of people taking public transportation 
to work, survey 20 people in New York City. Conduct the survey 
by sitting in Central Park on a bench and interviewing every 
person who sits next to you. 

5. To determine the average cost of a two day stay in a hospital in 
Massachusetts, survey 100 hospitals across the state using simple 
random sampling. 


Sampling and Data: More on Sampling (edited: Teegarden) 


Note:This module is a DRAFT. 


If we were to examine two samples representing the same population, they 
would, more than likely, not be the same. Just as there is variation in data, 
there is variation in samples. As you become accustomed to sampling, the 
variability will seem natural. 


Example: 

Suppose ABC College has 10,000 part-time students (the population). We 
are interested in the average amount of money a part-time student spends 
on books in the fall term. Asking all 10,000 students is an almost 
impossible task. 

Suppose we take two different samples. 

First, we use convenience sampling and survey 10 students from a first 
term organic chemistry class. Many of these students are taking first term 
calculus in addition to the organic chemistry class . The amount of money 
they spend is as follows: 

$128 $87 $173 $116 $130 $204 $147 $189 $93 $153 

The second sample is taken by using a list from the P.E. department of 
senior citizens who take P.E. classes and taking every 5th senior citizen on 
the list, for a total of 10 senior citizens. They spend: 

$50 $40 $36 $15 $50 $100 $40 $53 $22 $22 

Exercise: 


Problem: 


Do you think that either of these samples is representative of (or is 
characteristic of) the entire 10,000 part-time student population? 


Solution: 


No. The first sample probably consists of science-oriented students. 
Besides the chemistry course, some of them are taking firm-term 
calculus. Books for these classes tend to be expensive. Most of these 
students are, more than likely, paying more than the average part-time 
student for their books. The second sample is a group of senior 
citizens who are, more than likely, taking courses for health and 
interest. The amount of money they spend on books is probably much 
less than the average part-time student. Both samples are biased. Also, 
in both cases, not all students have a chance to be in either sample. 


Exercise: 
Problem: 


Since these samples are not representative of the entire population, is 
it wise to use the results to describe the entire population? 


Solution: 


No. Never use a sample that is not representative or does not have the 
characteristics of the population. 


Now, suppose we take a third sample. We choose ten different part-time 
students from the disciplines of chemistry, math, English, psychology, 
sociology, history, nursing, physical education, art, and early childhood 
development. Each student is chosen using simple random sampling. Using 
a calculator, random numbers are generated and a student from a particular 
discipline is selected if he/she has a corresponding number. The students 
spend: 

$180 $50 $150 $85 $260 $75 $180 $200 $200 $150 

Exercise: 


Problem: 
Do you think this sample is representative of the population? 
Solution: 


Yes. It is chosen from different disciplines across the population. 


Students often ask if it is "good enough" to take a sample, instead of 
surveying the entire population. If the survey is done well, the answer is 
yes. 


Exercise: 


Problem: 


As a class, determine whether or not the following samples are 
representative. If they are not, discuss the reasons. 


1. To find the average GPA of all students in a university, use all 
honor students at the university as the sample. 

2. To find out the most popular cereal among young people under 
the age of 10, stand outside a large supermarket for three hours 
and speak to every 20th child under age 10 who enters the 
supermarket. 

3. To find the average annual income of all adults in the United 
States, sample U.S. congressmen. Create a cluster sample by 
considering each state as a stratum (group). By using simple 
random sampling, select states to be part of the cluster. Then 
survey every U.S. congressman in the cluster. 

4. To determine the proportion of people taking public transportation 
to work, survey 20 people in New York City. Conduct the survey 
by sitting in Central Park on a bench and interviewing every 
person who sits next to you. 

5. To determine the average cost of a two day stay in a hospital in 
Massachusetts, survey 100 hospitals across the state using simple 
random sampling. 


Variation 

This module discusses statistical variability within data and samples. 
Students will be given the opportunity to see this variability in action 
through participation in an optional classroom exercise. This module also 
has a section that discusses Critical Evaluation. 


Variation in Data 


Variation is present in any set of data. For example, 16-ounce cans of 
beverage may contain more or less than 16 ounces of liquid. In one study, 
eight 16 ounce cans were measured and produced the following amount (in 
ounces) of beverage: 


15.6.16.1 15.2 14.8 15.8 15.9 16.0 15.5 


Measurements of the amount of beverage in a 16-ounce can may vary 
because different people make the measurements or because the exact 
amount, 16 ounces of liquid, was not put into the cans. Manufacturers 
regularly run tests to determine if the amount of beverage in a 16-ounce can 
falls within the desired range. 


Be aware that as you take data, your data may vary somewhat from the data 
someone else is taking for the same purpose. This is completely natural. 
However, if two or more of you are taking the same data and get very 
different results, it is time for you and the others to reevaluate your data- 
taking methods and your accuracy. 


Variation in Samples 


It was mentioned previously that two or more samples from the same 
population, taken randomly, and having close to the same characteristics of 
the population are different from each other. Suppose Doreen and Jung both 
decide to study the average amount of time students at their college sleep 
each night. Doreen and Jung each take samples of 500 students. Doreen 
uses systematic sampling and Jung uses cluster sampling. Doreen's sample 
will be different from Jung's sample. Even if Doreen and Jung used the 


same sampling method, in all likelihood their samples would be different. 
Neither would be wrong, however. 


Think about what contributes to making Doreen's and Jung's samples 
different. 


If Doreen and Jung took larger samples (i.e. the number of data values is 
increased), their sample results (the average amount of time a student 
Sleeps) might be closer to the actual population average. But still, their 
samples would be, in all likelihood, different from each other. This 
variability in samples cannot be stressed enough. 


Size of a Sample 


The size of a sample (often called the number of observations) is important. 
The examples you have seen in this book so far have been small. Samples 
of only a few hundred observations, or even smaller, are sufficient for many 
purposes. In polling, samples that are from 1200 to 1500 observations are 
considered large enough and good enough if the survey is random and is 
well done. You will learn why when you study confidence intervals. 


Be aware that many large samples are biased. For example, call-in surveys 
are invariable biased because people choose to respond or not. 


Optional Collaborative Classroom Exercise 


Exercise: 


Problem: 


Divide into groups of two, three, or four. Your instructor will give each 
group one 6-sided die. Try this experiment twice. Roll one fair die (6- 
sided) 20 times. Record the number of ones, twos, threes, fours, fives, 
and sixes you get below ("frequency" is the number of times a 
particular face of the die occurs): 


Face on Die Frequency 


i) 


rs) 
6 


First Experiment (20 rolls) 


Face on Die Frequency 


i. 


rs) 
6 


Second Experiment (20 rolls) 


Did the two experiments have the same results? Probably not. If you 
did the experiment a third time, do you expect the results to be 
identical to the first or second experiment? (Answer yes or no.) Why 
or why not? 


Which experiment had the correct results? They both did. The job of 
the statistician is to see through the variability and draw appropriate 
conclusions. 


Critical Evaluation 


We need to critically evaluate the statistical studies we read about and 
analyze before accepting the results of the study. Common problems to be 
aware of include 


e Problems with Samples: A sample should be representative of the 
population. A sample that is not representative of the population is 
biased. Biased samples that are not representative of the population 
give results that are inaccurate and not valid. 

e Self-Selected Samples: Responses only by people who choose to 
respond, such as call-in surveys are often unreliable. 

e Sample Size Issues: Samples that are too small may be unreliable. 
Larger samples are better if possible. In some situations, small samples 
are unavoidable and can still be used to draw conclusions, even though 
larger samples are better. Examples: Crash testing cars, medical testing 
for rare conditions. 

e Undue influence: Collecting data or asking questions in a way that 
influences the response. 

e Non-response or refusal of subject to participate: The collected 
responses may no longer be representative of the population. Often, 
people with strong positive or negative opinions may answer surveys, 
which can affect the results. 

e Causality: A relationship between two variables does not mean that 
one causes the other to occur. They may both be related (correlated) 
because of their relationship through a different variable. 

e Self-Funded or Self-Interest Studies: A study performed by a person or 
organization in order to support their claim. Is the study impartial? 


Read the study carefully to evaluate the work. Do not automatically 
assume that the study is good but do not automatically assume the 
study is bad either. Evaluate it on its merits and the work done. 

e Misleading Use of Data: Improperly displayed graphs, incomplete 
data, lack of context. 

e Confounding: When the effects of multiple factors on a response 
cannot be separated. Confounding makes it difficult or impossible to 
draw valid conclusions about the effect of each factor. 


Glossary 


Population 
The collection, or set, of all individuals, objects, or measurements 
whose properties are being studied. 


Sample 
A portion of the population understudy. A sample is representative if it 
characterizes the population being studied. 


Answers and Rounding Off 
This module briefly explains the correct way to round off answers when 
working with statistical data. 


A simple way to round off answers is to carry your final answer one more 
decimal place than was present in the original data. Round only the final 
answer. Do not round any intermediate results, if possible. If it becomes 
necessary to round intermediate results, carry them to at least twice as many 
decimal places as the final answer. For example, the average of the three 
quiz scores 4, 6, 9 is 6.3, rounded to the nearest tenth, because the data are 
whole numbers. Most answers will be rounded in this manner. 


It is not necessary to reduce most fractions in this course. Especially in 
Probability Topics, the chapter on probability, it is more helpful to leave an 
answer as an unreduced fraction. 


Frequency 

This module introduces the concepts of frequency, relative frequency, and 
cumulative relative frequency, and the relationship between these measures. 
Students will have the opportunity to interpret data through the sample problems 
provided. 


Twenty students were asked how many hours they worked per day. Their 
responses, in hours, are listed below: 


96332475235654435253 


Below is a frequency table listing the different data values in ascending order and 
their frequencies. 


DATA VALUE FREQUENCY 
2 3 
3 fs) 
4 3 
fs) 6 
6 2 
7 1 


Frequency Table of Student Work Hours 


A frequency is the number of times a given datum occurs in a data set. 
According to the table above, there are three students who work 2 hours, five 
students who work 3 hours, etc. The total of the frequency column, 20, represents 
the total number of students included in the sample. 


A relative frequency is the fraction or proportion of times an answer occurs. To 
find the relative frequencies, divide each frequency by the total number of 
students in the sample - in this case, 20. Relative frequencies can be written as 
fractions, percents, or decimals. 


DATA VALUE FREQUENCY RELATIVE FREQUENCY 


2 3 +. or 0.15 
3 5 2 or 0.25 
A 3 x or 0.15 
5 6 3; or 0.30 
6 2 + or 0.10 
7 1 $a or 0.05 


Frequency Table of Student Work Hours w/ Relative Frequency 


The sum of the relative frequency column is an or 1. 


Cumulative relative frequency is the accumulation of the previous relative 
frequencies. To find the cumulative relative frequencies, add all the previous 
relative frequencies to the relative frequency for the current row. 


CUMULATIVE 


DATA RELATIVE RELATIVE 
VALUE FREQUENCY FREQUENCY FREQUENCY 
2 3 3 or 0.15 0.15 

5 0.15 + 0.25 = 
3 i) 59 OF 0.25 0.40 

as 0.40 + 0.15 = 
4 3 50 OF 0.15 0.55 

6 0.55 + 0.30 = 
5 6 59 OF 0.30 0.85 

2 0.85 + 0.10 = 
6 2 30 OF 0.10 0.95 

ot) 0.95 + 0.05 = 
7 1 59 OF 0.05 1.00 


Frequency Table of Student Work Hours w/ Relative and Cumulative Relative 
Frequency 


The last entry of the cumulative relative frequency column is one, indicating that 
one hundred percent of the data has been accumulated. 


Note:Because of rounding, the relative frequency column may not always sum 
to one and the last entry in the cumulative relative frequency column may not be 
one. However, they each should be close to one. 


The following table represents the heights, in inches, of a sample of 100 male 
semiprofessional soccer players. 


HEIGHTS 
(INCHES) 


59.95 - 
61.95 


61.95 - 
63.95 


63.95 - 
65.95 


65.95 - 
67.95 


67.95 - 
69.95 


69.95 - 
71.95 


71.95 - 
7390 


73.95 - 
75.95 


FREQUENCY 


15 


40 


17 


12 


Total = 100 


59.95 - 61.95 inches 
61.95 - 63.95 inches 
63.95 - 65.95 inches 
65.95 - 67.95 inches 
67.95 - 69.95 inches 


RELATIVE 
FREQUENCY 
<a5 = 0.05 

=35 = 0.03 

G60 = 0-15 

2. = 0.40 

ay = 0.17 

Fay = 0.12 

aie = 0.07 

500 = 0.01 
Total = 1.00 


Frequency Table of Soccer Player Height 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.05 


0.05 + 0.03 = 
0.08 


0.08 + 0.15 = 
0.23 


0.23 + 0.40 = 
0.63 


0.63 + 0.17 = 
0.80 


0.80 + 0.12 = 
0.92 


0.92 + 0.07 = 
0.99 


0.99 + 0.01 = 
1.00 


The data in this table has been grouped into the following intervals: 


e 69.95 - 71.95 inches 
e 71.95 - 73.95 inches 
e 73.95 - 75.95 inches 


Note: This example is used again in the Descriptive Statistics chapter, where the 
method used to compute the intervals will be explained. 


In this sample, there are 5 players whose heights are between 59.95 - 61.95 
inches, 3 players whose heights fall within the interval 61.95 - 63.95 inches, 15 
players whose heights fall within the interval 63.95 - 65.95 inches, 40 players 
whose heights fall within the interval 65.95 - 67.95 inches, 17 players whose 
heights fall within the interval 67.95 - 69.95 inches, 12 players whose heights fall 
within the interval 69.95 - 71.95, 7 players whose height falls within the interval 
71.95 - 73.95, and 1 player whose height falls within the interval 73.95 - 75.95. 
All heights fall between the endpoints of an interval and not at the endpoints. 


Example: 
Exercise: 


Problem: 


From the table, find the percentage of heights that are less than 65.95 
inches. 


Solution: 


If you look at the first, second, and third rows, the heights are all less than 
65.95 inches. There are 5 + 3 + 15 = 23 males whose heights are less than 
65.95 inches. The percentage of heights less than 65.95 inches is then aan 
or 23%. This percentage is the cumulative relative frequency entry in the 


third row. 


Example: 


Exercise: 


Problem: 


From the table, find the percentage of heights that fall between 61.95 and 
65.95 inches. 


Solution: 


Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 
0.18 or 18%. 


Example: 
Exercise: 


Problem: 


Use the table of heights of the 100 male semiprofessional soccer players. 
Fill in the blanks and check your answers. 


. The percentage of heights that are from 67.95 to 71.95 inches is: 

. The percentage of heights that are from 67.95 to 73.95 inches is: 

. The percentage of heights that are more than 65.95 inches is: 

. The number of players in the sample who are between 61.95 and 71.95 
inches tall is: 

5. What kind of data are the heights? 

6. Describe how you could gather this data (the heights) so that the data 

are characteristic of all male semiprofessional soccer players. 


BRWN Ee 


Remember, you count frequencies. To find the relative frequency, divide 
the frequency by the total number of data values. To find the cumulative 
relative frequency, add all of the previous relative frequencies to the 
relative frequency for the current row. 


Solution: 


led 76 
2. 36% 
Be 11 20 


4. 87 

5. quantitative continuous 

6. get rosters from each team and choose a simple random sample from 
each 


Optional Collaborative Classroom Exercise 


Exercise: 


Problem: 


In your class, have someone conduct a survey of the number of siblings 
(brothers and sisters) each student has. Create a frequency table. Add to it a 
relative frequency column and a cumulative relative frequency column. 
Answer the following questions: 


1. What percentage of the students in your class has 0 siblings? 
2. What percentage of the students has from 1 to 3 siblings? 
3. What percentage of the students has fewer than 3 siblings? 


Example: 

Nineteen people were asked how many miles, to the nearest mile they commute 
to work each day. The data are as follows: 

Is iene alerlsres0) vaalO list ays esuhowilerd Wise) 

The following table was produced: 


CUMULATIVE 
RELATIVE RELATIVE 
DATA FREQUENCY FREQUENCY FREQUENCY 


CUMULATIVE 
RELATIVE RELATIVE 
DATA FREQUENCY FREQUENCY FREQUENCY 


3 3 + 0.1579 
4 i in 0.2105 
5 3 4 0.1579 
7 2 4 0.2632 
10 3 = 0.4737 
12 2 A 0.7895 
13 1 ts 0.8421 
15 1 a 0.8948 
18 1 a. 0.9474 
20 1 de 1.0000 


Frequency of Commuting Distances 


Exercise: 


Problem: 


1. Is the table correct? If it is not correct, what is wrong? 

2. True or False: Three percent of the people surveyed commute 3 miles. 
If the statement is not correct, what should it be? If the table is 
incorrect, make the corrections. 

3. What fraction of the people surveyed commute 5 or 7 miles? 

4. What fraction of the people surveyed commute 12 miles or more? 
Less than 12 miles? Between 5 and 13 miles (does not include 5 and 
13 miles)? 


Solution: 


1. No. Frequency column sums to 18, not 19. Not all cumulative relative 
frequencies are correct. 

2. False. Frequency for 3 miles should be 1; for 2 miles (left out), 2. 
Cumulative relative frequency column should read: 0.1052, 0.1579, 
0.2105, 0.3684, 0.4737, 0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1. 


5 
By ioeemre 
4.39 19> Ig 
Glossary 
Frequency 


The number of times a value of the data occurs. 


Relative Frequency 


The ratio of the number of times a value of the data occurs in the set of all 
outcomes to the number of all outcomes. 


Cumulative Relative Frequency 


The term applies to an ordered set of observations from smallest to largest. 
The Cumulative Relative Frequency is the sum of the relative frequencies 
for all values that are less than or equal to the given value. 


Sampling and Data: Frequency, Relative Frequency, and Cumulative Frequency 
(edited: Teegarden) 


Twenty students were asked how many hours they worked per day. Their 
responses, in hours, are listed below: 


96332475235654435253 


Below is a frequency table listing the different data values in ascending order and 
their frequencies. 


DATA VALUE FREQUENCY 
2 3 
3 S) 
4 3 
fs) 6 
6 Z 
7 1 


Frequency Table of Student Work Hours 


A frequency is the number of times a given datum occurs in a data set. 
According to the table above, there are three students who work 2 hours, five 
students who work 3 hours, etc. The total of the frequency column, 20, represents 
the total number of students included in the sample. 


A relative frequency. is the fraction of times an answer occurs. To find the 
relative frequencies, divide each frequency by the total number of students in the 
sample - in this case, 20. Relative frequencies can be written as fractions, 
percents, or decimals. 


DATA VALUE FREQUENCY RELATIVE FREQUENCY 


2 3 a or 0.15 
3 5 3; or 0.25 
4 3 a or 0.15 
5 6 35 or 0.30 
6 2 +. or 0.10 
7 1 35 or 0.05 


Frequency Table of Student Work Hours w/ Realative Frequency 


The sum of the relative frequency column is a or i; 


Cumulative relative frequency is the accumulation of the previous relative 
frequencies. To find the cumulative relative frequencies, add all the previous 
relative frequencies to the relative frequency for the current row. 


CUMULATIVE 
DATA RELATIVE RELATIVE 
VALUE FREQUENCY FREQUENCY FREQUENCY 

2 3 + or 0.15 0.15 

3 5 3 or 0.25 0.15 + 0.25 = 0.40 


4 3 + or 0.15 0.40 + 0.15 = 0.55 


DATA RELATIVE 
VALUE FREQUENCY FREQUENCY 
5 6 hr or 0.10 

6 2 + or 0.10 

7 1 35 or 0.05 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.55 + 0.30 = 0.85 
0.85 + 0.10 = 0.95 


0.95 + 0.05 = 1.00 


Frequency Table of Student Work Hours w/ Relative and Cumulative Frequency 


The last entry of the cumulative relative frequency column is one, indicating that 


one hundred percent of the data has been accumulated. 


Note:Because of rounding, the relative frequency column may not always sum 
to one and the last entry in the cumulative relative frequency column may not be 


one. However, they each should be close to one. 


The following table represents the heights, in inches, of a sample of 100 male 


semiprofessional soccer players. 


FREQUENCY 
HEIGHTS OF RELATIVE 
(INCHES) STUDENTS FREQUENCY 
59.95 - Bs 
61.95 5 a7 — 0-05 


Total = 100 Total = 1.00 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.05 


HEIGHTS 
(INCHES) 


61.95 - 
63.95 


63.95 - 
65.95 


65.95 - 
67.95 


67.95 - 
69.95 


69.95 - 
71.95 


71.95 - 
73.99 


73.95 - 
75.95 


FREQUENCY 
OF 
STUDENTS 


Total = 100 


RELATIVE 
FREQUENCY 
=35 = 0.03 

+e = 0.15 

soo = 0-40 

sw = 0.17 

2 = 0.12 

sy = 0.07 

si = 0.01 
Total = 1.00 


Frequency Table of Soccer Player Height 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.05 + 0.03 = 
0.08 


0.08 + 0.15 = 
0.23 


0.23 + 0.40 = 
0.63 


0.63 + 0.17 = 
0.80 


0.80 + 0.12 = 
0.92 


0.92 + 0.07 = 
0.99 


0.99 + 0.01 = 
1.00 


The data in this table has been grouped into the following intervals: 


59.95 - 61.95 inches 
61.95 - 63.95 inches 
63.95 - 65.95 inches 
65.95 - 67.95 inches 
67.95 - 69.95 inches 
69.95 - 71.95 inches 
71.95 - 73.95 inches 
73.95 - 75.95 inches 


Note: This example is used again in the Descriptive Statistics chapter, where the 
method used to compute the intervals will be explained. 


In this sample, there are 5 players whose heights are between 59.95 - 61.95 
inches, 3 players whose heights fall within the interval 61.95 - 63.95 inches, 15 
players whose heights fall within the interval 63.95 - 65.95 inches, 40 players 
whose heights fall within the interval 65.95 - 67.95 inches, 17 players whose 
heights fall within the interval 67.95 - 69.95 inches, 12 players whose heights fall 
within the interval 69.95 - 71.95, 7 players whose height falls within the interval 
71.95 - 73.95, and 1 player whose height falls within the interval 73.95 - 75.95. 
All heights fall between the endpoints of an interval and not at the endpoints. 


Example: 
Exercise: 


Problem: 


From the table, find the percentage of heights that are less than 65.95 
inches. 


Solution: 


If you look at the first, second, and third rows, the heights are all less than 
65.95 inches. There are 5 + 3 + 15 = 23 males whose heights are less than 
65.95 inches. The percentage of heights less than 65.95 inches is then Fie 
or 23%. This percentage is the cumulative relative frequency entry in the 


third row. 


Example: 
Exercise: 


Problem: 


From the table, find the percentage of heights that fall between 61.95 and 
65.95 inches. 


Solution: 


Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 
0.18 or 18%. 


Example: 
Exercise: 


Problem: 


Use the table of heights of the 100 male semiprofessional soccer players. 
Fill in the blanks and check your answers. 


. The percentage of heights that are from 67.95 to 71.95 inches is: 

. The percentage of heights that are from 67.95 to 73.95 inches is: 

. The percentage of heights that are more than 65.95 inches is: 

. The number of players in the sample who are between 61.95 and 71.95 
inches tall is: 

5. What kind of data are the heights? 

6. Describe how you could gather this data (the heights) so that the data 

are characteristic of all male semiprofessional soccer players. 


BRWN eR 


Remember, you count frequencies. To find the relative frequency, divide 
the frequency by the total number of data values. To find the cumulative 
relative frequency, add all of the previous relative frequencies to the 
relative frequency for the current row. 


Solution: 


129% 

2. 36% 

ey Uae 

4.87 

5. quantitative continuous 

6. get rosters from each team and choose a simple random sample from 
each 


Example: 

Nineteen people were asked how many miles, to the nearest mile they commute 
to work each day. The data are as follows: 

Poo 2 NOLS 20 7 01S a2 el 0 

The following table was produced: 


CUMULATIVE 
RELATIVE RELATIVE 
DATA FREQUENCY FREQUENCY FREQUENCY 
3 3 an 0.1579 
4 1 is 0.2105 
5 3 oe 0.1579 
li 2 4 0.2632 
10 3 ae 0.4737 
12 2 + 0.7895 
13 1 is 0.8421 
ie 1 is 0.8948 
18 i an 0.9474 
20 1 sli 1.0000 


Frequency of Commuting Distances 


Exercise: 


Problem: 


1. Is the table correct? If it is not correct, what is wrong? 

2. True or False: Three percent of the people surveyed commute 3 miles. 
If the statement is not correct, what should it be? If the table is 
incorrect, make the corrections. 

3. What fraction of the people surveyed commute 5 or 7 miles? 

4. What fraction of the people surveyed commute 12 miles or more? 
Less than 12 miles? Between 5 and 13 miles (does not include 5 and 
13 miles)? 


Solution: 


1. No. Frequency column sums to 18, not 19. Not all cumulative relative 
frequencies are correct. 

2. False. Frequency for 3 miles should be 1; for 2 miles (left out), 2. 
Cumulative relative frequency column should read: 0.1052, 0.1579, 
0.2105, 0.3684, 0.4737, 0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1. 


5 
; Bo» 7 
4.793 19> 9 

Glossary 

Frequency 


A number of times a value of the data is occurred in the set of all data. 


Relative Frequency 
The ratio of a number of times a value of the data is occurred in the set of all 
outcomes to the number of all outcomes. 


Cumulative Relative Frequency 
The concept applies to an ordered set of observations from smallest to 
largest, or vise versa. Cumulative relative frequency is the sum of relative 
frequencies for all values that are less than or equal to the given value. 


Summary 

This module provides an outline/review of key concepts related to statistical 
sampling and data. 

Statistics 


¢ Deals with the collection, analysis, interpretation, and presentation of 
data 


Probability 
e Mathematical tool used to study randomness 
Key Terms 


e Population 
e Parameter 
e Sample 

e Statistic 

e Variable 

e Data 


Types of Data 
¢ Quantitative Data (a number) 


o Discrete (You count it.) 
o Continuous (You measure it.) 


¢ Qualitative Data (a category, words) 
Sampling 
¢ With Replacement: A member of the population may be chosen more 
than once 
e Without Replacement: A member of the population may be chosen 


only once 


Random Sampling 


e Each member of the population has an equal chance of being selected 
Sampling Methods 
e Random 


Simple random sample 
Stratified sample 
Cluster sample 
Systematic sample 


oO 0 0 


e Not Random 
o Convenience sample 
Frequency (freq. or f) 
e The number of times an answer occurs 
Relative Frequency (rel. freq. or RF) 


e The proportion of times an answer occurs 
e Can be interpreted as a fraction, decimal, or percent 


Cumulative Relative Frequencies (cum. rel. freq. or cum RF) 


e An accumulation of the previous relative frequencies 


Practice: Sampling and Data 

This module provides an opportunity for students to practice concepts 
related to statistical sampling and data. Given a sample data set, the student 
will practice constructing frequency tables, differentiating between key 
terms, and comparing sampling techniques. 


Student Learning Outcomes 


e The student will construct frequency tables. 
e The student will differentiate between key terms. 
e The student will compare sampling techniques. 


Given 


Studies are often done by pharmaceutical companies to determine the 
effectiveness of a treatment program. Suppose that a new AIDS antibody 
drug is currently under study. It is given to patients once the AIDS 
symptoms have revealed themselves. Of interest is the average(mean) 
length of time in months patients live once starting the treatment. Two 
researchers each follow a different set of 40 AIDS patients from the start of 
treatment until their deaths. The following data (in months) are collected. 


Researcher A3 4 11 15 16 17 22 44 37 16 14 24 25 15 26 27 33 29 35 44 
13 21 22 10 12 8 40 32 26 27 31 34 29 17 8 24 18 47 33 34 


Researcher B3 14 115 16 17 28 41 31 18 14 14 26 25 21 22 31 2 35 44 23 
2121 16 12 18 41 22 16 25 33 34 29 13 18 24 23 42 33 29 


Organize the Data 


Complete the tables below using the data provided. 


Cumulative 
Survival Length Relative Relative 
(in months) Frequency Frequency Frequency 
0.5-6.5 
6.5-12.5 
12.5 - 18.5 
18.5 - 24.5 
24.5 - 30.5 
30.5 - 36.5 
36.5 - 42.5 


42.5 - 48.5 


Researcher A 


Cumulative 
Survival Length Relative Relative 
(in months) Frequency Frequency Frequency 
0.5-6.5 
6.5-12.5 


12.5 - 18.5 


Cumulative 

Survival Length Relative Relative 

(in months) Frequency Frequency Frequency 
18.5 - 24.5 

24.5 - 30.5 

30.5 - 36.5 

36.5 - 42.5 

42.5 - 48.5 


Researcher B 


Key Terms 


Define the key terms based upon the above example for Researcher A. 
Exercise: 


Problem: Population 


Exercise: 


Problem: Sample 


Exercise: 


Problem: Parameter 


Exercise: 


Problem: Statistic 


Exercise: 


Problem: Variable 


Exercise: 


Problem: Data 


Discussion Questions 


Discuss the following questions and then answer in complete sentences. 
Exercise: 


Problem: List two reasons why the data may differ. 
Exercise: 
Problem: 
Can you tell if one researcher is correct and the other one is incorrect? 
Why? 


Exercise: 


Problem: Would you expect the data to be identical? Why or why not? 


Exercise: 


Problem: How could the researchers gather random data? 
Exercise: 
Problem: 
Suppose that the first researcher conducted his survey by randomly 
choosing one state in the nation and then randomly picking 40 patients 


from that state. What sampling method would that researcher have 
used? 


Exercise: 


Problem: 


Suppose that the second researcher conducted his survey by choosing 
AO patients he knew. What sampling method would that researcher 
have used? What concerns would you have about this data set, based 
upon the data collection method? 


Sampling and Data: Homework (edited: Teegarden) 
Exercise: 


Problem: For each item below: 


e ildentify the type of data (quantitative - discrete, quantitative - 
continuous, or qualitative) that would be used to describe a 
response. 

e iiGive an example of the data. 


e aNumber of tickets sold to a concert 

¢ bAmount of body fat 

e cFavorite baseball team 

e dTime in line to buy groceries 

e eNumber of students enrolled at Evergreen Valley College 

e fMost—watched television show 

e gBrand of toothpaste 

e hDistance to the closest movie theatre 

e iAge of executives in Fortune 500 companies 

e jNumber of competing computer spreadsheet software packages 


Solution: 


e aquantitative - discrete 

e bquantitative - continuous 
¢ cqualitative 

e dquantitative - continuous 
¢ equantitative - discrete 

e fqualitative 

e gqualitative 

e hquantitative - continuous 
e iquantitative - continuous 
e jquantitative - discrete 


Exercise: 


Problem: 


Fifty part-time students were asked how many courses they were 
taking this term. The (incomplete) results are shown below: 


Cumulative 
# of Relative Relative 
Courses Frequency Frequency Frequency 
1 30 0.6 
2 15 
3 


Part-time Student Course Loads 


e aFill in the blanks in the table above. 
e¢ bWhat percent of students take exactly two courses? 
e¢ cWhat percent of students take one or two courses? 


Exercise: 
Problem: 
Sixty adults with gum disease were asked the number of times per 


week they used to floss before their diagnoses. The (incomplete) 
results are shown below: 


Cumulative 


# Flossing Relative Relative 
per Week Frequency Frequency Freq. 

0 27 0.45 

1 18 

3 0.93 

6 3 0.05 

7 1 0.02 


Flossing Frequency for Adults with Gum Disease 


e aFill in the blanks in the table above. 
¢ bWhat percent of adults flossed six times per week? 
¢ cWhat percent flossed at most three times per week? 


Solution: 


e b5% 
e c93% 


Exercise: 


Problem: 


A fitness center is interested in the average amount of time a client 
exercises in the center each week. Define the following in terms of the 
study. Give examples where appropriate. 


e aPopulation 
e bSample 
e cParameter 


e dStatistic 
e eVariable 
e fData 


Exercise: 


Problem: 


Ski resorts are interested in the average age that children take their first 
ski and snowboard lessons. They need this information to optimally 
plan their ski classes. Define the following in terms of the study. Give 
examples where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Solution: 


e aChildren who take ski and snowboard lessons 

e bA group of these children 

e cThe population average 

e dThe sample average 

e eX = the age of one child who takes the first ski or snowboard 
lesson 

e fA value for X, such as 3, 7, etc. 


Exercise: 
Problem: 
A cardiologist is interested in the average recovery period for her 


patients who have had heart attacks. Define the following in terms of 
the study. Give examples where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Exercise: 


Problem: 


Insurance companies are interested in the average health costs each 
year for their clients, so that they can determine the costs of health 
insurance. Define the following in terms of the study. Give examples 
where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Solution: 


e aThe clients of the insurance companies 
e bA group of the clients 

e cThe average age of the clients 

e dThe average age of the sample 

e eX = the age of one client 

e fA value for X, such as 34, 9, 82, etc. 


Exercise: 


Problem: 


A politician is interested in the proportion of voters in his district who 
think he is doing a good job. Define the following in terms of the 
study. Give examples where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Exercise: 


Problem: 


A marriage counselor is interested in the proportion the clients she 
counsels who stay married. Define the following in terms of the study. 
Give examples where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Solution: 


e aAll the clients of the counselor 

¢ bA group of the clients 

e cThe proportion of all her clients who stay married 
e dThe proportion of the sample who stay married 

e eX =the number of couples who stay married 

e fyes, no 


Exercise: 


Problem: 


Political pollsters may be interested in the proportion of people who 
will vote for a particular cause. Define the following in terms of the 
study. Give examples where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Exercise: 


Problem: 


A marketing company is interested in the proportion of people who 
will buy a particular product. Define the following in terms of the 
study. Give examples where appropriate. 


e aPopulation 
e bSample 

e cParameter 
e dStatistic 

e eVariable 

e fData 


Solution: 


e aAll people (maybe in a certain geographic area, such as the 
United States) 

e bA group of the people 

e cThe proportion of all people who will buy the product 

e dThe proportion of the sample who will buy the product 

e eX =the number of people who will buy it 


e fbuy, not buy 


Exercise: 


Problem: 


Airline companies are interested in the consistency of the number of 
babies on each flight, so that they have adequate safety equipment. 
Suppose an airline conducts a survey. Over Thanksgiving weekend, it 
surveys 6 flights from Boston to Salt Lake City to determine the 
number of babies on the flights. It determines the amount of safety 
equipment needed by the result of that study. 


e aUsing complete sentences, list three things wrong with the way 
the survey was conducted. 

e bUsing complete sentences, list three ways that you would 
improve the survey if it were to be repeated. 


Exercise: 
Problem: 
Suppose you want to determine the average number of students per 


Statistics class in your state. Describe a possible sampling method in 3 
— 5 complete sentences. Be detailed. 


Exercise: 
Problem: 
Suppose you want to determine the average number of cans of soda 


drunk each month by persons in their twenties. Describe a possible 
sampling method in 3 - 5 complete sentences. Be detailed. 


Exercise: 


Problem: 


726 distance learning students at Long Beach City College in the 
2004-2005 academic year were surveyed and asked the reasons they 
took a distance learning class. (Source: Amit Schitai, Director of 
Instructional Technology and Distance Learning, LBCC). The results 
of this survey are listed in the table below. 


Convenience 87.6% 
Unable to come to campus 85.1% 
Taking on-campus courses in addition to my DL 71.7% 
course 

Instructor has a good reputation 69.1% 
To fulfill requirements for transfer 60.8% 
To fulfill requirements for Associate Degree 53.6% 
Thought DE would be more varied and interesting 53.2% 
I like computer technology 52.1% 
Had success with previous DL course 52.0% 
On-campus sections were full 42.1% 
To fulfill requirements for vocational certification 27.1% 


Because of disability 20.5% 


Reasons for Taking LBCC Distance Learning Courses 


Assume that the survey allowed students to choose from the responses 
listed in the table above. 


e aWhy can the percents add up to over 100%? 

¢ bDoes that necessarily imply a mistake in the report? 

e cHow do you think the question was worded to get responses that 
totaled over 100%? 

e dHow might the question be worded to get responses that totaled 
100%? 


Exercise: 
Problem: 


Nineteen immigrants to the U.S were asked how many years, to the 
nearest year, they have lived in the U.S. The data are as follows: 


20 /22102015070.20512151245.10 


The following table was produced: 


Relative Cumulative 
Data Frequency Frequency Relative Frequency 
0 2 + 0.1053 
2 3 + 0.2632 


4 1 5 0.3158 


Relative Cumulative 


Data Frequency Frequency Relative Frequency 
5 3 J 0.1579 
7 2 4S 0.5789 
10 2 +S 0.6842 
12 2 a 0.7895 
15 1 is 0.8421 
20 i ty 1.0000 


Frequency of Immigrant Survey Responses 


aFix the errors on the table. Also, explain how someone might 
have arrived at the incorrect number(s). 

bExplain what is wrong with this statement: “47 percent of the 
people surveyed have lived in the U.S. for 5 years.” 

cFix the statement above to make it correct. 

dWhat fraction of the people surveyed have lived in the U.S. 5 or 
7 years? 

eWhat fraction of the people surveyed have lived in the U.S. at 
most 12 years? 

fWhat fraction of the people surveyed have lived in the U.S. 
fewer than 12 years? 

gWhat fraction of the people surveyed have lived in the U.S. from 
5 to 20 years, inclusive? 


Exercise: 


Problem: 


A “random survey” was conducted of 3274 people of the 
“microprocessor generation” (people born since 1971, the year the 
microprocessor was invented). It was reported that 48% of those 
individuals surveyed stated that if they had $2000 to spend, they would 
use it for computer equipment. Also, 66% of those surveyed 
considered themselves relatively savvy computer users. (Source: San 
Jose Mercury News) 


e aDo you consider the sample size large enough for a study of this 
type? Why or why not? 

e bBased on your “gut feeling,” do you believe the percents 
accurately reflect the U.S. population for those individuals born 
since 1971? If not, do you think the percents of the population are 
actually higher or lower than the sample statistics? Why? 


Additional information: The survey was reported by Intel Corporation 
of individuals who visited the Los Angeles Convention Center to see 
the Smithsonian Institure's road show called “America’s Smithsonian.” 


e cWith this additional information, do you feel that all 
demographic and ethnic groups were equally represented at the 
event? Why or why not? 

e dWith the additional information, comment on how accurately 
you think the sample statistics reflect the population parameters. 


Exercise: 


Problem: 


e aList some practical difficulties involved in getting accurate 
results from a telephone survey. 

e bList some practical difficulties involved in getting accurate 
results from a mailed survey. 

e cWith your classmates, brainstorm some ways to overcome these 
problems if you needed to conduct a phone or mail survey. 


Try these multiple choice questions 


The next four questions refer to the following: A Lake Tahoe 
Community College instructor is interested in the average number of days 
Lake Tahoe Community College math students are absent from class during 
a quarter. 

Exercise: 


Problem: What is the population she is interested in? 


e AAIl Lake Tahoe Community College students 

¢ BAIl Lake Tahoe Community College English students 

e CAI] Lake Tahoe Community College students in her classes 
e DAIl Lake Tahoe Community College math students 


Solution: 


D 


Exercise: 


Problem: Consider the following: 


X = number of days a Lake Tahoe Community College math 
student is absent 


In this case, X is an example of a: 


e AVariable 
e¢ BPopulation 
e CStatistic 

e DData 


Solution: 


A 
Exercise: 
Problem: 
The instructor takes her sample by gathering data on 5 randomly 


selected students from each Lake Tahoe Community College math 
class. The type of sampling she used is 


e ACluster sampling 

e BStratified sampling 

e CSimple random sampling 
e DConvenience sampling 


Solution: 


B 
Exercise: 


Problem: 


The instructor’s sample produces an average number of days absent of 
3.5 days. This value is an example of a 


e AParameter 
e BData 

e CStatistic 
e DVariable 


Solution: 
Cc 


The next two questions refer to the following relative frequency table on 
hurricanes that have made direct hits on the U.S between 1851 and 2004. 


Hurricanes are given a strength category rating based on the minimum wind 
speed generated by the storm. (http://www.nhc.noaa. gov/gifs/tables5. gif) 


Category 
1 


2 


Frequency of Hurricane Direct Hits 


Exercise: 


Problem: 


Number of 
Direct Hits 


109 


72 


71 


18 


3 


Total = 273 


Relative 
Frequency 


0.3993 
0.2637 


0.2601 


0.0110 


Cumulative 
Frequency 


0.3993 


0.6630 


0.9890 


1.0000 


What is the relative frequency of direct hits were category 4 


hurricanes? 


e A0.0768 
e BO0.0659 
e C0.2601 


e DNot enough information to calculate 


Solution: 


B 
Exercise: 
Problem: 


What is the relative frequency of direct hits were AT MOST a category 
3 storm? 


e A0.3480 
e B0.9231 
e C0.2601 
e DO0.3370 


Solution: 


B 


The next three questions refer to the following: A study was done to 
determine the age, number of times per week and the duration (amount of 
time) of resident use of a local park in San Jose. The first house in the 
neighborhood around the park was selected randomly and then every 8th 
house in the neighborhood around the park was interviewed. 

Exercise: 


Problem: "‘Number of times per week’" is what type of data? 
e Aqualitative 


e Bquantitative - discrete 
e Cquantitative - continuous 


Solution: 


B 


Exercise: 


Problem: The sampling method was: 


e Asimple random 
e Bsystematic 

e Cstratified 

e Dcluster 


Solution: 


B 


Exercise: 


a 


Problem: "‘Duration (amount of time)’" is what type of data? 


e Aqualitative 


e Bquantitative - discrete 
e Cquantitative - continuous 


Solution: 


C 


Candy Lab 

This lab allows students to practice and demonstrate techniques used to 
generate systematic samples. Students will have the opportunity to create 
relative frequency tables and interpret results based on different data 
groupings. Labs modified to include minitab usage. This replaces the 
Chapter 1, Lab 1 from the Collaborative Statistics by Dean and Illowsky. 


Candy Lab 


Name: 


Student Learning Outcomes 


e The student will construct Relative Frequency Tables. 

e The student will interpret results and their differences from different 
data groupings. 

e The student will illustrate the data using pie charts and bar graphs. 


General Directions 


In class, answer the initial questions on the lines provided and complete the 
tables. The write-up questions should be answered in paragraph form and 
typed. The graphs must be generated in Minitab and may then be copied 
into the write-up or attached at the end of the lab. To save paper, please 
copy the graphs and paste them into Word so that more than one graph can 
be printed per page. 


Data Collection 


Before you open your candy, make you first prediction about the 
distribution of the colors of this candy. This prediction will be made with no 
knowledge about the color distribution except your personal experience. 
(There are no wrong answers.) 


1. How many candies do you think there are in your package? 


2. Which color do you think will occur the most often? 


3. Which color do you think will occur the least? 


Open your candy and sort them by color. DO NOT EAT ANY AT THIS 
TIME! 


1. How many candies do you have? 

2. Which color occurred the most often? 

3. Which color occurred the least? 

4. Based on your individual sample, what do you think the actual color 
distribution is for this candy? Predict a % for each color and explain your 
reasoning. You can only base your prediction on what you see in your 


sample, not what you know about the candy. 


YOU MAY NOW EAT YOUR CANDY NOW 


Summarizing the Data 


Complete the three relative frequency tables below using your personal 
data, your group data and the class data. 


Individual Bag Frequency Table 


Color Frequency Relative Frequency 


Group Color Frequency Table 


Color Frequency Relative Frequency 


Class Frequency Table 


Color Frequency Relative Frequency 


Graphs 


1. Illustrate the data for each set (individual, group, class) by inputting the 
colors in C1, individual frequencies in C2, group frequencies in C3, and 
class frequencies in C4 and create a pie and bar chart for each. 


2. You may copy the graphs into the write-up or attach them to your Lab. If 
you choose not to include the graphs in the body of the write-up, please 
copy and paste them into a Word document so as not to printout 6 pages of 
graphs. 


Write — up 


Answer the following questions in paragraph form. To compare/contrast the 
data you should refer to at least three key values. Either where the data is 
similar or where it is very different. Be sure to use the information from 
your charts and graphs to justify your statements using the data. 


1. Using the tables and graphs, compare/contrast the results for your data 
and the group's combined data. Use at least three examples to justify your 
answer. 


2. Using the tables and graphs, compare/contrast the results for your data 
and the class' combined data. Use at least three examples to justify your 
answer. 


3. Using the tables and graphs, compare/contrast the results for the group's 
combined data and the class’ combined data. Use at least three examples 
to justify your answer. 


4. Which of the three data sets would you use to best predict the distribution 
of colors for this candy? Why? What would you predict the actual 
distribution for this candy may be? 


Descriptive Statistics 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Display data graphically and interpret graphs: stemplots, histograms 
and boxplots. 

e Recognize, describe, and calculate the measures of location of data: 
quartiles and percentiles. 

e Recognize, describe, and calculate the measures of the center of data: 
mean, median, and mode. 

e Recognize, describe, and calculate the measures of the spread of data: 
variance, standard deviation, and range. 


Introduction 


Once you have collected data, what will you do with it? Data can be 
described and presented in many different formats. For example, suppose 
you are interested in buying a house in a particular area. You may have no 
clue about the house prices, so you might ask your real estate agent to give 
you a sample data set of prices. Looking at all the prices in the sample often 
is overwhelming. A better way might be to look at the median price and the 
variation of prices. The median and variation are just two ways that you 
will learn to describe data. Your agent might also provide you with a graph 
of the data. 


In this chapter, you will study numerical and graphical ways to describe and 
display your data. This area of statistics is called "Descriptive Statistics". 
You will learn to calculate, and even more importantly, to interpret these 
measurements and graphs. 


Displaying Data 
This module provides a brief introduction into the ways graphs and charts 
can be used to provide visual representations of data. 


A Statistical graph is a tool that helps you learn about the shape or 
distribution of a sample. The graph can be a more effective way of 
presenting data than a mass of numbers because we can see where data 
clusters and where there are only a few data values. Newspapers and the 
Internet use graphs to show trends and to enable readers to compare facts 
and figures quickly. 


Statisticians often graph data first to get a picture of the data. Then, more 
formal tools may be applied. 


Some of the types of graphs that are used to summarize and organize data 
are the dot plot, the bar chart, the histogram, the stem-and-leaf plot, the 
frequency polygon (a type of broken line graph), pie charts, and the 
boxplot. In this chapter, we will briefly look at stem-and-leaf plots, line 
graphs and bar graphs. Our emphasis will be on histograms and boxplots. 


Stem and Leaf Graphs (Stemplots) 
This module introduces the use of stem-and-leaf graphs (stemplots), line 
graphs and bar graphs for describing a set of data visually. 


One simple graph, the stem-and-leaf graph or stem plot, comes from the 
field of exploratory data analysis.It is a good choice when the data sets are 
small. To create the plot, divide each observation of data into a stem and a 
leaf. The leaf consists of a final significant digit. For example, 23 has stem 
2 and leaf 3. Four hundred thirty-two (432) has stem 43 and leaf 2. Five 
thousand four hundred thirty-two (5,432) has stem 543 and leaf 2. The 
decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from 
smallest the largest. Draw a vertical line to the right of the stems. Then 
write the leaves in increasing order next to their corresponding stem. 


Example: 

For Susan Dean's spring pre-calculus class, scores for the first exam were 
as follows (smallest to largest): 
334249495355556163676868696972737478808388888890929494949496 
100 


Stem Leaf 
3 3 

4 299 
5 355 


6 1378899 


Stem Leaf 


7 2348 

8 03888 

9 0244446 
10 0 


Stem-and-Leaf Diagram 


The stem plot shows that most scores fell in the 60s, 70s, 80s, and 90s. 
Eight out of the 31 scores or approximately 26% of the scores were in the 
90's or 100, a fairly high number of As. 


The stem plot is a quick way to graph and gives an exact picture of the data. 
You want to look for an overall pattern and any outliers. An outlier is an 
observation of data that does not fit the rest of the data. It is sometimes 
called an extreme value. When you graph an outlier, it will appear not to fit 
the pattern of the graph. Some outliers are due to mistakes (for example, 
writing down 50 instead of 500) while others may indicate that something 
unusual is happening. It takes some background information to explain 
outliers. In the example above, there were no outliers. 


Example: 

Create a stem plot using the data: 
1.11.52.32.52.73.23.33.33.53.84.0 4.24.54.54.74.85.55.66.56.712.3 
The data are the distance (in kilometers) from a home to the nearest 
supermarket. 

Exercise: 


Problem: 


1. Are there any values that might possibly be outliers? 
2. Do the data seem to have any concentration of values? 


Note:The leaves are to the right of the decimal. 


Solution: 


The value 12.3 may be an outlier. Values appear to concentrate at 3 
and 4 kilometers. 


Stem Leaf 

1 15 

Z Sew 

3 23 35.0 
4 025578 
5 56 

6 5! 

oy 


Stem Leaf 


11 


12 3 


Another type of graph that is useful for specific data values is a line graph. 
In the particular line graph shown in the example, the x-axis consists of 
data values and the y-axis consists of frequency points. The frequency 
points are connected. 


Example: 
In a survey, 40 mothers were asked how many times per week a teenager 


must be reminded to do his/her chores. The results are shown in the table 
and the line graph. 


Number of times teenager is reminded Frequency 
0 2 


1 fs) 


Number of times teenager is reminded Frequency 


2. 8 
2 14 
4 7 
5 4 
16 
14 
12 
10 
Frequency & 
6 
4 
2 
0 
0 1 2 3 4 5 6 
Number of Times Teenager is 
Reminded 


Bar graphs consist of bars that are separated from each other. The bars can 
be rectangles or they can be rectangular boxes and they can be vertical or 
horizontal. 


The bar graph shown in Example 4 has age groups represented on the x- 
axis and proportions on the y-axis. 


Example: 

By the end of 2011, in the United States, Facebook had over 146 million 
users. The table shows three age groups, the number of users in each age 
group and the proportion (%) of users in each age group. Source: 
http://www.kenburbary.com/2011/03/facebook-demographics-revisited- 
2011-statistics-2/ 


Age Number of Proportion (“%) of 
groups Facebook users Facebook users 
13-25 65,082,280 45% 
26 - 44 53,300,200 36% 
45 - 64 27,885,100 19% 
50 
Ages 45 


Proportion(%) 25 


Example: 

The columns in the table below contain the race/ethnicity of U.S. Public 
Schools: High School Class of 2011, percentages for the Advanced 
Placement Examinee Population for that class and percentages for the 
Overall Student Population. The 3-dimensional graph shows the 
Race/Ethnicity of U.S. Public Schools (qualitative data) on the x-axis and 
Advanced Placement Examinee Population percentages on the y-axis. 
(Source: http://www.collegeboard.com and Source: 
http://apreport.collegeboard.org/goals-and-findings/promoting-equity) 


Overall 

AP Examinee Student 
Race/Ethnicity Population Population 
1 = Asian, Asian 
American or Pacific 10.3% 5.7% 
Islander 
2= Black or African 9.0% 14.7% 
American 
3 = Hispanic or Latino 17.0% 17.6% 
4 = American Indian or x 5 
Alaska Native es ee 
5 = White 57.1% 59.2% 


6 = Not reported/other 6.0% 1.7% 


Ethnicity/Race vs. Percent of AP 
Examinees 


57.1 


17 
10.3 


Go to Outcomes of Education Figure 22 for an example of a bar graph that 
shows unemployment rates of persons 25 years and older for 2009. 


Note:This book contains instructions for constructing a histogram and a 
box plot for the TI-83+ and TI-84 calculators. You can find additional 
instructions for using these calculators on the Texas Instruments (T1) 
website. 


Glossary 


Outlier 
An observation that does not fit the rest of the data. 


Histograms 

This module provides an overview of Descriptive Statistics: Histogram as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


For most of the work you do in this book, you will use a histogram to 
display the data. One advantage of a histogram is that it can readily display 
large data sets. A rule of thumb is to use a histogram when the data set 
consists of 100 values or more. 


A histogram consists of contiguous boxes. It has both a horizontal axis and 
a vertical axis. The horizontal axis is labeled with what the data represents 
(for instance, distance from your home to school). The vertical axis is 
labeled either Frequency or relative frequency. The graph will have the 
same shape with either label. The histogram (like the stemplot) can give 
you the shape of the data, the center, and the spread of the data. (The next 
section tells you how to calculate the center and the spread.) 


The relative frequency is equal to the frequency for an observed value of 
the data divided by the total number of data values in the sample. (In the 
chapter on Sampling and Data, we defined frequency as the number of 
times an answer occurs.) If: 


e f = frequency 

e mn = total number of data values (or the sum of the individual 
frequencies), and 

¢ RF = relative frequency, 


then: 
Equation: 


RF = 2 
n 


For example, if 3 students in Mr. Ahab's English class of 40 students 
received from 90% to 100%, then, 


a = Gol tees Cola ce. 
f=3,n=40,andRF= — Go = 0.075 


Seven and a half percent of the students received 90% to 100%. Ninety 
percent to 100 % are quantitative measures. 


To construct a histogram, first decide how many bars or intervals, also 
called classes, represent the data. Many histograms consist of from 5 to 15 
bars or classes for clarity. Choose a starting point for the first interval to be 
less than the smallest data value. A convenient starting point is a lower 
value carried out to one more decimal place than the value with the most 
decimal places. For example, if the value with the most decimal places is 
6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 - 
0.05 = 6.05). We say that 6.05 has more precision. If the value with the 
most decimal places is 2.23 and the lowest value is 1.5, a convenient 
Starting point is 1.495 (1.5 - 0.005 = 1.495). If the value with the most 
decimal places is 3.234 and the lowest value is 1.0, a convenient starting 
point is 0.9995 (1.0 - .0005 = 0.9995). If all the data happen to be integers 
and the smallest value is 2, then a convenient starting point is 1.5 (2 - 0.5 = 
1.5). Also, when the starting point and other boundaries are carried to one 
additional decimal place, no data value will fall on a boundary. 


Example: 

The following data are the heights (in inches to the nearest half inch) of 
100 male semiprofessional soccer players. The heights are continuous data 
since height is measured. 

60 60.5 61 61 61.5 

63.5 63.5 63.5 

64 64 64 64 64 64 64 64.5 64.5 64.5 64.5 64.5 64.5 64.5 64.5 

66 66 66 66 66 66 66 66 66 66 66.5 66.5 66.5 66.5 66.5 66.5 66.5 66.5 
66:5°66.5.56.0.67 67 67 67 67 67 67.67 67 67 67 57 67.5679 67.5675 
OO Osc 0723 

66-08 69:69 16959 69 69'69'69 6969. 69.5 69.5-69.9 0015 Gor 

7070-70 7070 70;70.5: 7059 70.9 71 71 7A 

T2722 72 2 oO 3-8 

74 


The smallest data value is 60. Since the data with the most decimal places 
has one decimal (for instance, 61.5), we want our starting point to have two 
decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient 
numbers, use 0.05 and subtract it from 60, the smallest value, for the 
convenient starting point. 

60 - 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal 
place. The starting point is, then, 59.95. 

The largest value is 74. 74+ 0.05 = 74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this 
width, subtract the starting point from the ending value and divide by the 
number of bars (you must choose the number of bars you desire). Suppose 
you choose 8 bars. 

Equation: 


74.05 — 59.95 


— eG 
8 


Note: We will round up to 2 and make each bar or class interval 2 units 
wide. Rounding up to 2 is one way to prevent a value from falling on a 
boundary. Rounding to the next number is necessary even if it goes 
against the standard rules of rounding. For this example, using 1.76 as the 
width would also work. 


The boundaries are: 


* 59.95 

Li trayw elo sae nee on Less) 
Lda ee eae ak oo} 9/8) 
S26) Joe too 
© Ga oee 7195 
LAV rehe sae eee fa) 
eeOo ore = liao 
e 71.95 + 2 = 73.95 


73952 = / aos 


The heights 60 through 61.5 inches are in the interval 59.95 - 61.95. The 
heights that are 63.5 are in the interval 61.95 - 63.95. The heights that are 
64 through 64.5 are in the interval 63.95 - 65.95. The heights 66 through 
67.5 are in the interval 65.95 - 67.95. The heights 68 through 69.5 are in 
the interval 67.95 - 69.95. The heights 70 through 71 are in the interval 
69.95 - 71.95. The heights 72 through 73.5 are in the interval 71.95 - 
73.95. The height 74 is in the interval 73.95 - 75.95. 

The following histogram displays the heights on the x-axis and relative 
frequency on the y-axis. 


Relative 
Frequency 


04 


59.95 61.95 63.95 65.95 67.95 69.95 71.95 73.95 75.95 


Heights 


Example: 
The following data are the number of books bought by 50 part-time college 
students at ABC College. The number of books is discrete data since books 


are counted. 

SoD ene Reale Aly hela 

Dee ee 

Sicushenershe mel oeomelogele roi, 

444444 

Deore 

66 

Eleven students buy 1 book. Ten students buy 2 books. Sixteen students 
buy 3 books. Six students buy 4 books. Five students buy 5 books. Two 
students buy 6 books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value 
and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and 
the ending value is 6.5. 

Exercise: 


Problem: 


Next, calculate the width of each bar or class interval. If the data are 
discrete and there are not too many different values, a width that 
places the data values in the middle of the bar or class interval is the 
most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 
and the starting point is 0.5, a width of one places the 1 in the middle 
of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 
1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in 
the middle of the interval from to , the 5 in the 
middle of the interval from to , and the in 
the middle of the interval from to 


Solution: 


e 3.5t0 4.5 
e 4.5t05.5 
° 6 

e 5.5 to 6.5 


Calculate the number of bars as follows: 
Equation: 


6.5—0.5 _ 1 
bars 


where 1 is the width of a bar. Therefore, bars = 
The following histogram displays the number of books on the x-axis and 


the frequency on the y-axis. 
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Number of Books 


Using the TI-83, 83+, 84, 84+ Calculator Instructions 

Go to the Appendix (14:Appendix) in the menu on the left. There are 
calculator instructions for entering data and for creating a customized 
histogram. Create the histogram for Example 2. 


e Press Y=. Press CLEAR to clear out any equations. 

e Press STAT 1:EDIT. If L1 has data in it, arrow up into the name L1, 
press CLEAR and arrow down. If necessary, do the same for L2. 

e Into L1, enter 1, 2, 3, 4, 5,6 

e Into L2, enter 11, 10, 16, 6, 5, 2 

e Press WINDOW. Make Xmin = .5, Xmax = 6.5, Xscl = (6.5 - .5)/6, 
Ymin = -1, Ymax = 20, Yscl = 1, Xres = 1 


e Press 2nd Y=. Start by pressing 4:Plotsoff ENTER. 

e Press 2nd Y=. Press 1:Plot1. Press ENTER. Arrow down to TYPE. 
Arrow to the 3rd picture (histogram). Press ENTER. 

e Arrow down to Xlist: Enter L1 (2nd 1). Arrow down to Freq. Enter L2 
(2nd 2). 

e Press GRAPH 

e Use the TRACE key and the arrow keys to examine the histogram. 


Optional Collaborative Exercise 


Count the money (bills and change) in your pocket or purse. Your instructor 
will record the amounts. As a class, construct a histogram displaying the 
data. Discuss how many intervals you think is appropriate. You may want to 
experiment with the number of intervals. Discuss, also, the shape of the 
histogram. 


Record the data, in dollars (for example, 1.25 dollars). 


Construct a histogram. 


Glossary 


Frequency 
The number of times a value of the data occurs. 


Relative Frequency 
The ratio of the number of times a value of the data occurs in the set of 
all outcomes to the number of all outcomes. 


Box Plots 


Box plots or box-whisker plots give a good graphical image of the 
concentration of the data. They also show how far from most of the data the 
extreme values are. The box plot is constructed from five values: the 
smallest value, the first quartile, the median, the third quartile, and the 
largest value. The median, the first quartile, and the third quartile will be 
discussed here, and then again in the section on measuring data in this 
chapter. We use these values to compare how close other data values are to 
them. 


The median, a number, is a way of measuring the "center" of the data. You 
can think of the median as the "middle value," although it does not actually 
have to be one of the observed values. It is a number that separates ordered 
data into halves. Half the values are the same number or smaller than the 
median and half the values are the same number or larger. For example, 
consider the following data: 


111.567.248910688322101 
Ordered from smallest to largest: 
11224668 7.288.39101011.5 


The median is between the 7th value, 6.8, and the 8th value 7.2. To find the 
median, add the two values together and divide by 2. 
Equation: 


6847.2 _ 


7 
2 


The median is 7. Half of the values are smaller than 7 and half of the values 
are larger than 7. 


Quartiles are numbers that separate the data into quarters. Quartiles may or 
may not be part of the data. To find the quartiles, first find the median or 
second quartile. The first quartile is the middle value of the lower half of 


the data and the third quartile is the middle value of the upper half of the 
data. To get the idea, consider the same data set shown above: 


112246687.28839101011.5 


The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 
4, 6, 6.8. The middle value of the lower half is 2. 


11224668 


The number 2, which is part of the data, is the first quartile. One-fourth of 
the values are the same or less than 2 and three-fourths of the values are 
more than 2. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of 
the upper half is 9. 


7.288.39 1010 11.5 


The number 9, which is part of the data, is the third quartile. Three-fourths 
of the values are less than 9 and one-fourth of the values are more than 9. 


To construct a box plot, use a horizontal number line and a rectangular box. 
The smallest and largest data values label the endpoints of the axis. The first 
quartile marks one end of the box and the third quartile marks the other end 
of the box. The middle fifty percent of the data fall inside the box. The 
"whiskers" extend from the ends of the box to the smallest and largest data 
values. The box plot gives a good quick picture of the data. 


Note: You may encounter box and whisker plots that have dots marking 
outlier values. In those cases, the whiskers are not extending to the 
minimum and maximum values. 


Consider the following data: 


112246687.28839101011.5 


The first quartile is 2, the median is 7, and the third quartile is 9. The 
smallest value is 1 and the largest value is 11.5. The box plot is constructed 
as follows (see calculator instructions in the back of this book or on the TI 
web site): 


1 2 K - 5 6 7 8 9 10 11 11.5 


The two whiskers extend from the first quartile to the smallest value and 
from the third quartile to the largest value. The median is shown with a 
dashed line. 


Example: 

The following data are the heights of 40 students in a statistics class. 

59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 
08 09! 70 70°70 70 70 71 71-72 72 73 7474 Ja 77 

Construct a box plot: 

Using the TI-83, 83+, 84, 84+ Calculator 


e Enter data into the list editor (Press STAT 1:EDIT). If you need to 
clear the list, arrow up to the name L1, press CLEAR, arrow down. 

e Put the data values in list L1. 

e Press STAT and arrow to CALC. Press 1:1-VarStats. Enter L1. 

e Press ENTER 

e Use the down and up arrow keys to scroll. 


e Smallest value = 59 
e Largest value = 77 
e Q1: First quartile = 64.5 


e Q2: Second quartile or median= 66 
¢ Q3: Third quartile = 70 


Using the TI-83, 83+, 84, 84+ to Construct the Box Plot 
Go to 14:Appendix for Notes for the TI-83, 83+, 84, 84+ Calculator. To 
create the box plot: 


e Press Y=. If there are any equations, press CLEAR to clear them. 

e Press 2nd Y=. 

e Press 4:Plotsoff. Press ENTER 

e Press 2nd Y= 

e Press 1:Ploti. Press ENTER. 

e Arrow down and then use the right arrow key to go to the 5th picture 
which is the box plot. Press ENTER. 

e Arrow down to Xlist: Press 2nd 1 for L1 

e Arrow down to Freq: Press ALPHA. Press 1. 

e Press ZOOM. Press 9:ZoomStat. 

e Press TRACE and use the arrow keys to examine the box plot. 


59 645 66 70 17 


e aEach quarter has 25% of the data. 

e bThe spreads of the four quarters are 64.5 - 59 = 5.5 (first quarter), 66 
- 64.5 = 1.5 (second quarter), 70 - 66 = 4 (3rd quarter), and 77 - 70 = 
7 (fourth quarter). So, the second quarter has the smallest spread and 
the fourth quarter has the largest spread. 

e cinterquartile Range: IQR = Q3 — Q1 = 70 — 64.5 = 5.5. 

e dThe interval 59 through 65 has more than 25% of the data so it has 
more data in it than the interval 66 through 70 which has 25% of the 
data. 

e eThe middle 50% (middle half) of the data has a range of 5.5 inches. 


For some sets of data, some of the largest value, smallest value, first 
quartile, median, and third quartile may be the same. For instance, you 
might have a data set in which the median and the third quartile are the 
same. In this case, the diagram would not have a dotted line inside the box 
displaying the median. The right side of the box would display both the 
third quartile and the median. For example, if the smallest value and the 
first quartile were both 1, the median and the third quartile were both 5, 
and the largest value was 7, the box plot would look as follows: 


Example: 

Test scores for a college statistics class held during the day are: 

99 56 78 55.5 32 90 80 81 56 59 45 77 84.5 84 70 72 68 32 79 90 

Test scores for a college statistics class held during the evening are: 

98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 79 78 98 90 79 81 25.5 
Exercise: 


Problem: 


e What are the smallest and largest data values for each data set? 

e¢ What is the median, the first quartile, and the third quartile for 
each data set? 

e Create a boxplot for each set of data. 

e¢ Which boxplot has the widest spread for the middle 50% of the 
data (the data between the first and third quartiles)? What does 
this mean for that set of data in comparison to the other set of 
data? 


e For each data set, what percent of the data is between the 
smallest value and the first quartile? (Answer: 25%) the first 
quartile and the median? (Answer: 25%) the median and the third 
quartile? the third quartile and the largest value? What percent of 
the data is between the first quartile and the largest value? 
(Answer: 75%) 


Solution: 
First Data Set 


e Xmin = 32 
«Oil 56 

e M = 74.5 
* (O3— 82:5 
e Xmax = 99 


Second Data Set 

e Xmin = 25.5 
Or 71s 
M=8l 


Q3 = 89 
Xmax = 98 


—L_i FF — 
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The first data set (the top box plot) has the widest spread for the middle 
50% of the data. IQR = Q3 — Q1 is 82.5 — 56 = 26.5 for the first data 
set and 89 — 78 = 11 for the second data set. So, the first set of data has 
its middle 50% of scores more spread out. 

25% of the data is between M and Q3 and 25% is between Q3 and Xmax. 


Glossary 


Median 
A number that separates ordered data into halves. Half the values are 
the same number or smaller than the median and half the values are the 
same number or larger than the median. The median may or may not 
be part of the data. 


Quartiles 
The numbers that separate the data into quarters. Quartiles may or may 
not be part of the data. The second quartile is the median of the data. 


Measures of the Location of the Data 

Descriptive Statistics: Measuring the Location of Data explains percentiles and 
quartiles and is part of the collection col10555 written by Barbara Illowsky and 
Susan Dean. Roberta Bloom contributed the section "Interpreting Percentiles, 
Quartile and the Median." 


The common measures of location are quartiles and percentiles (%iles). Quartiles 
are special percentiles. The first quartile, Q ; is the same as the 25th percentile 
(25th %ile) and the third quartile, @ 3, is the same as the 75th percentile (75th 
%ile). The median, M, is called both the second quartile and the 50th percentile 
(50th %ile). 


Note: Quartiles are given special attention in the Box Plots module in this chapter. 


To calculate quartiles and percentiles, the data must be ordered from smallest to 
largest. Recall that quartiles divide ordered data into quarters. Percentiles divide 
ordered data into hundredths. To score in the 90th percentile of an exam does not 
mean, necessarily, that you received 90% on a test. It means that 90% of test 
scores are the same or less than your score and 10% of the test scores are the same 
or greater than your test score. 


Percentiles are useful for comparing values. For this reason, universities and 
colleges use percentiles extensively. 


Percentiles are mostly used with very large populations. Therefore, if you were to 
say that 90% of the test scores are less (and not the same or less) than your score, 
it would be acceptable because removing one particular data value is not 
significant. 


The interquartile range is a number that indicates the spread of the middle half 
or the middle 50% of the data. It is the difference between the third quartile (Q 3) 
and the first quartile (Q 1). 

Equation: 


IQR = Q3— Qi 


The IQR can help to determine potential outliers. A value is suspected to be a 
potential outlier if it is less than (1.5)(IQR) below the first quartile or more 
than (1.5)(IQR) above the third quartile. Potential outliers always need further 
investigation. 


Example: 
Exercise: 


Problem: 

For the following 13 real estate prices, calculate the IQR and determine if 
any prices are outliers. Prices are in dollars. (Source: San Jose Mercury 
News) 


389,950 230,500 158,000 479,000 639,000 114,950 5,500,000 387,000 
659,000 529,000 575,000 488,800 1,095,000 


Solution: 


Order the data from smallest to largest. 


114,950 158,000 230,500 387,000 389,950 479,000 488,800 529,000 
975,000 639,000 659,000 1,095,000 5,500,000 


M = 488,800 

Qy = 20800387000 — 308750 

Qz = $82000+859000 — 649000 

IQR = 649000 — 308750 = 340250 

(1.5)(IQR) = (1.5)(340250) = 510375 

Qi — (1.5)(IQR) = 308750 — 510375 = —201625 
Q3 + (1.5)(IQR) = 649000 + 510375 = 1159375 


No house price is less than -201625. However, 5,500,000 is more than 
1,159,375. Therefore, 5,500,000 is a potential outlier. 


Example: 
Exercise: 


Problem: 
For the two data sets in the test scores example, find the following: 


e aThe interquartile range. Compare the two interquartile ranges. 

e bAny outliers in either set. 

¢ cThe 30th percentile and the 80th percentile for each set. How much 
data falls below the 30th percentile? Above the 80th percentile? 


Solution: 


For the IQRs, see the answer to the test scores example. The first data set has 
the larger IQR, so the scores between Q3 and Q1 (middle 50%) for the first 
data set are more spread out and not clustered about the median. 


First Data Set 


¢ (2) - (IQR) = (2) - (26.5) = 39.75 
e Xmax - Q3 = 99 - 82.5 = 1 


(3) : (IQR) = 39.75 is larger than 16.5 and larger than 24, so the first set 
has no outliers. 


Second Data Set 
(+) - (IQR) = (>) - (11) = 16.5 
e Xmax — Q3 = 98 — 89 = 9 
e QI — Xmin = 78 — 25.5 = 52.5 


= : (IQR) = 16.5 is larger than 9 but smaller than 52.5, so for the 
second set 45 and 25.5 are outliers. 


To find the percentiles, create a frequency, relative frequency, and 


Data Chapter). Get the percentiles from that chart. 
First Data Set 


¢ 30th %ile (between the 6th and 7th values) = pote) = iH 


80th %ile (between the 16th and 17th values) = en) = 84.25 
Second Data Set 


¢ 30th %ile (7th value) = 78 
° 80th %ile (18th value) = 90 


30% of the data falls below the 30th %ile, and 20% falls above the 80th 
%ile. 


Example: 

Finding Quartiles and Percentiles Using a Table 

Fifty statistics students were asked how much sleep they get per school night 
(rounded to the nearest hour). The results were (student data): 


AMOUNT 

OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 
(HOURS) FREQUENCY FREQUENCY FREQUENCY 


4 2 0.04 0.04 


fs) fs) 0.10 0.14 


AMOUNT 


OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 


(HOURS) FREQUENCY FREQUENCY FREQUENCY 


6 7 0.14 0.28 
ye 12 0.24 0.52 
8 14 0.28 0.80 
9 7. 0.14 0.94 
10 3 0.06 1.00 


Find the 28th percentile: Notice the 0.28 in the "cumulative relative frequency" 
column. 28% of 50 data values = 14. There are 14 values less than the 28th %ile. 
They include the two 4s, the five 5s, and the seven 6s. The 28th %ile is between 
the last 6 and the first 7. The 28th %ile is 6.5. 

Find the median: Look again at the "cumulative relative frequency " column and 
find 0.52. The median is the 50th %ile or the second quartile. 50% of 50 = 25. 
There are 25 values less than the median. They include the two 4s, the five 5s, the 
seven 6s, and eleven of the 7s. The median or 50th %ile is between the 25th (7) 
and 26th (7) values. The median is 7. 

Find the third quartile: The third quartile is the same as the 75th percentile. You 
can "eyeball" this answer. If you look at the "cumulative relative frequency" 
column, you find 0.52 and 0.80. When you have all the 4s, 5s, 6s and 7s, you 
have 52% of the data. When you include all the 8s, you have 80% of the data. 
The 75th “%ile, then, must be an 8 . Another way to look at the problem is to 
find 75% of 50 (= 37.5) and round up to 38. The third quartile, Q 3, is the 38th 
value which is an 8. You can check this answer by counting the values. (There are 
37 values below the third quartile and 12 values above.) 


Example: 


Exercise: 


Problem: Using the table: 


1. Find the 80th percentile. 

2. Find the 90th percentile. 

3. Find the first quartile. 

4. What is another name for the first quartile? 


Solution: 


pe = BE 
5 = 8. 
Look where cum. rel. freg. = 0.80. 80% of the data is 8 or less. 80th 
%ile is between the last 8 and first 9. 
Das 
Sie) 
4. First Quartile = 25th %ile 


Collaborative Classroom Exercise: Your instructor or a member of the class will 
ask everyone in class how many sweaters they own. Answer the following 
questions. 


1. How many students were surveyed? 

2. What kind of sampling did you do? 

3. Construct a table of the data. 

4. Construct 2 different histograms. For each, starting value = ending 
value= 

5. Use the table to find the median, first quartile, and third quartile. 

6. Construct a box plot. 

7. Use the table to find the following: 


o The 10th percentile 
o The 70th percentile 
o The percent of students who own less than 4 sweaters 


Interpreting Percentiles, Quartiles, and Median 


A percentile indicates the relative standing of a data value when data are sorted 
into numerical order, from smallest to largest. p% of data values are less than or 
equal to the pth percentile. For example, 15% of data values are less than or equal 
to the 15th percentile. 


¢ Low percentiles always correspond to lower data values. 
e High percentiles always correspond to higher data values. 


A percentile may or may not correspond to a value judgment about whether it is 
"good" or "bad". The interpretation of whether a certain percentile is good or bad 
depends on the context of the situation to which the data applies. In some 
situations, a low percentile would be considered "good'; in other contexts a high 
percentile might be considered "good". In many situations, there is no value 
judgment that applies. 


Understanding how to properly interpret percentiles is important not only when 
describing data, but is also important in later chapters of this textbook when 
calculating probabilities. 


Guideline: 


When writing the interpretation of a percentile in the context of the given data, the 
sentence should contain the following information: 


e information about the context of the situation being considered, 

e the data value (value of the variable) that represents the percentile, 

e the percent of individuals or items with data values below the percentile. 

e Additionally, you may also choose to state the percent of individuals or items 
with data values above the percentile. 


Example: 
On a timed math test, the first quartile for times for finishing the exam was 35 
minutes. Interpret the first quartile in the context of this situation. 


e 25% of students finished the exam in 35 minutes or less. 

e 75% of students finished the exam in 35 minutes or more. 

e A low percentile could be considered good, as finishing more quickly on a 
timed exam is desirable. (If you take too long, you might not be able to 
finish.) 


Example: 
On a 20 question math test, the 70th percentile for number of correct answers was 
16. Interpret the 70th percentile in the context of this situation. 


¢ 70% of students answered 16 or fewer questions correctly. 

e 30% of students answered 16 or more questions correctly. 

e Note: A high percentile could be considered good, as answering more 
questions correctly is desirable. 


Example: 

At a certain community college, it was found that the 30th percentile of credit 
units that students are enrolled for is 7 units. Interpret the 30th percentile in the 
context of this situation. 


¢ 30% of students are enrolled in 7 or fewer credit units 

e 70% of students are enrolled in 7 or more credit units 

e In this example, there is no "good" or "bad" value judgment associated with 
a higher or lower percentile. Students attend community college for varied 
reasons and needs, and their course load varies according to their needs. 


Do the following Practice Problems for Interpreting Percentiles 
Exercise: 


Problem: 


e a For runners in arace, a low time means a faster run. The winners in a 
race have the shortest running times. Is it more desirable to have a finish 
time with a high or a low percentile when running a race? 

¢ b The 20th percentile of run times in a particular race is 5.2 minutes. 
Write a sentence interpreting the 20th percentile in the context of the 
situation. 


e cA bicyclist in the 90th percentile of a bicycle race between two towns 
completed the race in 1 hour and 12 minutes. Is he among the fastest or 
slowest cyclists in the race? Write a sentence interpreting the 90th 
percentile in the context of the situation. 


Solution: 


e a Forrunners in a race it is more desirable to have a low percentile for 
finish time. A low percentile means a short time, which is faster. 

¢ bINTERPRETATION: 20% of runners finished the race in 5.2 minutes 
or less. 80% of runners finished the race in 5.2 minutes or longer. 

¢ cHe is among the slowest cyclists (90% of cyclists were faster than 
him.) INTERPRETATION: 90% of cyclists had a finish time of 1 hour, 
12 minutes or less.Only 10% of cyclists had a finish time of 1 hour, 12 
minutes or longer 


Exercise: 


Problem: 


e a For runners in a race, a higher speed means a faster run. Is it more 
desirable to have a speed with a high or a low percentile when running a 
race? 

e bThe 40th percentile of speeds in a particular race is 7.5 miles per hour. 
Write a sentence interpreting the 40th percentile in the context of the 


situation. 


Solution: 


e aFor runners in a race it is more desirable to have a high percentile for 
speed. A high percentile means a higher speed, which is faster. 

¢ bINTERPRETATION: 40% of runners ran at speeds of 7.5 miles per 
hour or less (slower). 60% of runners ran at speeds of 7.5 miles per hour 
or more (faster). 


Exercise: 


Problem: 


On an exam, would it be more desirable to earn a grade with a high or low 
percentile? Explain. 


Solution: 
On an exam you would prefer a high percentile; higher percentiles 
correspond to higher grades on the exam. 

Exercise: 
Problem: 
Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait 
time of 32 minutes is the 85th percentile of wait times. Is that good or bad? 


Write a sentence interpreting the 85th percentile in the context of this 
situation. 


Solution: 


When waiting in line at the DMV, the 85th percentile would be a long wait 
time compared to the other people waiting. 85% of people had shorter wait 
times than you did. In this context, you would prefer a wait time 
corresponding to a lower percentile. INTERPRETATION: 85% of people at 
the DMV waited 32 minutes or less. 15% of people at the DMV waited 32 
minutes or longer. 


Exercise: 
Problem: 
In a survey collecting data about the salaries earned by recent college 


graduates, Li found that her salary was in the 78th percentile. Should Li be 
pleased or upset by this result? Explain. 


Solution: 


Li should be pleased. Her salary is relatively high compared to other recent 
college grads. 78% of recent college graduates earn less than Li does. 22% of 
recent college graduates earn more than Li does. 


Exercise: 


Problem: 


In a study collecting data about the repair costs of damage to automobiles in a 
certain type of crash tests, a certain model of car had $1700 in damage and 
was in the 90th percentile. Should the manufacturer and/or a consumer be 
pleased or upset by this result? Explain. Write a sentence that interprets the 
90th percentile in the context of this problem. 


Solution: 


The manufacturer and the consumer would be upset. This is a large repair 
cost for the damages, compared to the other cars in the sample. 
INTERPRETATION: 90% of the crash tested cars had damage repair costs of 
$1700 or less; only 10% had damage repair costs of $1700 or more. 


Exercise: 


Problem: 


e The University of California has two criteria used to set admission 
standards for freshman to be admitted to a college in the UC system: 

e a. Students' GPAs and scores on standardized tests (SATs and ACTs) are 
entered into a formula that calculates an "admissions index" score. The 
admissions index score is used to set eligibility standards intended to 
meet the goal of admitting the top 12% of high school students in the 
state. In this context, what percentile does the top 12% represent? 

e b. Students whose GPAs are at or above the 96th percentile of all 
students at their high school are eligible (called eligible in the local 
context), even if they are not in the top 12% of all students in the state. 
What percent of students from each high school are "eligible in the local 
context"? 


Solution: 
¢ aThe top 12% of students are those who are at or above the 88th 
percentile of admissions index scores. 


¢ b The top 4% of students' GPAs are at or above the 96th percentile, 
making the top 4% of students "eligible in the local context". 


Exercise: 


Problem: 


Suppose that you are buying a house. You and your realtor have determined 
that the most expensive house you can afford is the 34th percentile. The 34th 
percentile of housing prices is $240,000 in the town you want to move to. In 
this town, can you afford 34% of the houses or 66% of the houses? 


Solution: 


You can afford 34% of houses. 66% of the houses are too expensive for your 
budget. INTERPRETATION: 34% of houses cost $240,000 or less. 66% of 
houses cost $240,000 or more. 


**With contributions from Roberta Bloom 


Glossary 


Interquartile Range (IRQ) 
The distance between the third quartile (Q3) and the first quartile (Q1). IQR 
= Q3 - Ql. 


Outlier 
An observation that does not fit the rest of the data. 


Percentile 
A number that divides ordered data into hundredths. 


Example: 
Let a data set contain 200 ordered observations starting with 


{2.3,2.7,2.8,2.9,2.9,3.0...}. Then the first percentile is @7?*) — 9.75, 


because 1% of the data is to the left of this point on the number line and 99% of 


hae! eho 0n 2: ; 
the data is on its right. The second percentile is ( = *) — 2.9. Percentiles may 


or may not be part of the data. In this example, the first percentile is not in the 
data, but the second percentile is. The median of the data is the second quartile 
and the 50th percentile. The first and third quartiles are the 25th and the 75th 
percentiles, respectively. 


Quartiles 
The numbers that separate the data into quarters. Quartiles may or may not be 
part of the data. The second quartile is the median of the data. 


Measures of the Center of the Data 
This chapter discusses measuring descriptive statistical information using the 
center of the data 


The "center" of a data set is also a way of describing location. The two most 
widely used measures of the "center" of the data are the mean (average) and the 
median. To calculate the mean weight of 50 people, add the 50 weights 
together and divide by 50. To find the median weight of the 50 people, order 
the data and find the number that splits the data into two equal parts (previously 
discussed under box plots in this chapter). The median is generally a better 
measure of the center when there are extreme values or outliers because it is not 
affected by the precise numerical values of the outliers. The mean is the most 
common measure of the center. 


Note:The words "mean" and "average" are often used interchangeably. The 
substitution of one word for the other is common practice. The technical term 
is "arithmetic mean" and "average" is technically a center location. However, 
in practice among non-statisticians, "average" is commonly accepted for 
"arithmetic mean.” 


The mean can also be calculated by multiplying each distinct value by its 
frequency and then dividing the sum by the total number of data values. The 
letter used to represent the sample mean is an x with a bar over it (pronounced 
"e Dal ix, 


The Greek letter js (pronounced "mew" ) represents the population mean. One 
of the requirements for the sample mean to be a good estimate of the population 
mean is for the sample taken to be truly random. 


To see that both ways of calculating the mean are the same, consider the 
sample: 


11122344444 
Equation: 


1+14+1+24+24+34+44+4+4+4+4 
OS SSS eS ee eS SS 
11 


2.7 


Equation: 


3x14+2x24+1x3+5x4 
O————————— 
11 


2.0 


In the second calculation for the sample mean, the frequencies are 3, 2, 1, and 
De 

You can quickly find the location of the median by using the expression — 
The letter n is the total number of data values in the sample. If n is an odd 
number, the median is the middle value of the ordered data (ordered smallest to 
largest). If n is an even number, the median is equal to the two middle values 
added together and divided by 2 after the data has been ordered. For example, if 
the total number of data values is 97, then oa = —— = 49. The median is the 
49th value in the ordered data. If the total number of data values is 100, then 
igs = we = 50.5. The median occurs midway between the 50th and 51st 
values. The location of the median and the value of the median are not the 
same. The upper case letter (V is often used to represent the median. The next 
example illustrates the location of the median and the value of the median. 


Example: 
Exercise: 


Problem: 


AIDS data indicating the number of months an AIDS patient lives after 
taking a new antibody drug are as follows (smallest to largest): 


3488 1011121314151516161717182122222424252626272729293132333 
33434353740444447 


Calculate the mean and the median. 


Solution: 


The calculation for the mean is: 


= 23.6 


ia [3+4+(8)(2)+10+11+12+13+414+(15)(2)+(16)(2)+...+35+37+40+(44) (2)+47| 
<4 40 


To find the median, M, first use the formula for the location. The location 
is: 


far —. sel __ 
oR 


Starting at the smallest value, the median is located between the 20th and 
21st values (the two 24s): 


34881011121314151516161717182122222424 
25262627272929313233333434353740444447 


M= — =n 


The median is 24. 


Using the TI-83,83+,84, 84+ Calculators 
Calculator Instructions are located in the menu item 14:Appendix (Notes for 
the TI-83, 83+, 84, 84+ Calculators). 


e Enter data into the list editor. Press STAT 1:EDIT 

e Put the data values in list L1. 

e Press STAT and arrow to CALC. Press 1:1-VarStats. Press 2nd 1 for L1 
and ENTER. 

e Press the down and up arrow keys to scroll. 


ec = 23.6, M = 24 


Example: 
Exercise: 


Problem: 
Suppose that, in a small town of 50 people, one person earns $5,000,000 


per year and the other 49 each earn $30,000. Which is the better measure 
of the "center," the mean or the median? 


Solution: 


= es ae — 129400 
M = 30000 


(There are 49 people who earn $30,000 and one person who earns 
$5,000,000.) 


The median is a better measure of the "center" than the mean because 49 
of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. 
The 30,000 gives us a better sense of the middle of the data. 


Another measure of the center is the mode. The mode is the most frequent 
value. If a data set has two values that occur the same number of times, then the 
set is bimodal. 


Example: 

Statistics exam scores for 20 students are as follows 
Statistics exam scores for 20 students are as follows: 
5055.59 59°63 63 72 72 72 72 72:76 7861 63:64 6484 90153 
Exercise: 


Problem:Find the mode. 
Solution: 


The most frequent score is 72, which occurs five times. Mode = 72. 


Example: 
Five real estate exam scores are 430, 430, 480, 480, 495. The data set is 
bimodal because the scores 430 and 480 each occur twice. 


When is the mode the best measure of the "center"? Consider a weight loss 
program that advertises a mean weight loss of six pounds the first week of the 
program. The mode might indicate that most people lose two pounds the first 
week, making the program less appealing. 


Note:The mode can be calculated for qualitative data as well as for 
quantitative data. 


Statistical software will easily calculate the mean, the median, and the mode. 
Some graphing calculators can also make these calculations. In the real world, 
people make these calculations using software. 


The Law of Large Numbers and the Mean 


The Law of Large Numbers says that if you take samples of larger and larger 
size from any population, then the mean z of the sample is very likely to get 
closer and closer to p. This is discussed in more detail in The Central Limit 
Theorem. 


Note:The formula for the mean is located in the Summary of Formulas section 
course. 


Sampling Distributions and Statistic of a Sampling Distribution 


You can think of a sampling distribution as a relative frequency distribution 
with a great many samples. (See Sampling and Data for a review of relative 
frequency). Suppose thirty randomly selected students were asked the number 
of movies they watched the previous week. The results are in the relative 
frequency table shown below. 


# of movies Relative Frequency 


0 


5/30 


15/30 


6/30 


4/30 


1/30 


If you let the number of samples get very large (say, 300 million or more), 
the relative frequency table becomes a relative frequency distribution. 


A statistic is a number calculated from a sample. Statistic examples include the 
mean, the median and the mode as well as others. The sample mean z is an 
example of a statistic which estimates the population mean . 


Glossary 


Mean 


A number that measures the central tendency. A common name for mean 
is ‘average.’ The term 'mean' is a shortened form of ‘arithmetic mean.' By 


definition, the mean for a sample (denoted by 2) is 


Sum of all values in th 1 , 
ugbecot oslucsin theasmcle® and the mean for a population (denoted by 


) isu = Sum of all values in the population 
be /t = Number of values in the population * 


| 


Median 


A number that separates ordered data into halves. Half the values are the 
same number or smaller than the median and half the values are the same 
number or larger than the median. The median may or may not be part of 
the data. 


Mode 


The value that appears most frequently in a set of data. 


Skewness and the Mean, Median, and Mode 
Consider the following data set: 
456667777778836910 


This data set produces the histogram shown below. Each interval has width 
one and each value is located in the middle of an interval. 


4 5 6 7 8 9 10 


The histogram displays a symmetrical distribution of data. A distribution is 
symmetrical if a vertical line can be drawn at some point in the histogram 
such that the shape to the left and the right of the vertical line are mirror 
images of each other. The mean, the median, and the mode are each 7 for 
these data. In a perfectly symmetrical distribution, the mean and the 
median are the same. This example has one mode (unimodal) and the 
mode is the same as the mean and median. In a symmetrical distribution 
that has two modes (bimodal), the two modes would be different from the 
mean and median. 


The histogram for the data: 
4566677778 


is not symmetrical. The right-hand side seems "chopped off" compared to 
the left side. The shape distribution is called skewed to the left because it is 
pulled out to the left. 


4 5 6 7 8 


The mean is 6.3, the median is 6.5, and the mode is 7. Notice that the 
mean is less than the median and they are both less than the mode. The 
mean and the median both reflect the skewing but the mean more so. 


The histogram for the data: 
677778886910 


is also not symmetrical. It is skewed to the right. 


6 7 8 9 10 


The mean is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, 
the mean is the largest, while the mode is the smallest. Again, the mean 
reflects the skewing the most. 


To summarize, generally if the distribution of data is skewed to the left, the 
mean is less than the median, which is often less than the mode. If the 
distribution of data is skewed to the right, the mode is often less than the 
median, which is less than the mean. 


Skewness and symmetry become important when we discuss probability 
distributions in later chapters. 


Descriptive Statistics: Measuring the Spread of the Data 

Descriptive Statistics: Measuring the Spread of Data explains standard deviation as a measure of variation in data 
and is part of the collection col10555 written by Barbara Illowsky and Susan Dean. Roberta Bloom made 
contributions that helped to clarify the standard deviation and the variance. 


An important characteristic of any set of data is the variation in the data. In some data sets, the data values are 
concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. 
The most common measure of variation, or spread, is the standard deviation. 


The standard deviation is a number that measures how far data values are from their mean. 
The standard deviation 


e provides a numerical measure of the overall amount of variation in a data set 
e can be used to determine whether a particular data value is close to or far from the mean 


The standard deviation provides a measure of the overall variation in a data set 

The standard deviation is always positive or 0. The standard deviation is small when the data are all concentrated 
close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are 
more spread out from the mean, exhibiting more variation. 


Suppose that we are studying waiting times at the checkout line for customers at supermarket A and supermarket 
B; the average wait time at both markets is 5 minutes. At market A, the standard deviation for the waiting time is 2 
minutes; at market B the standard deviation for the waiting time is 4 minutes. 


Because market B has a higher standard deviation, we know that there is more variation in the waiting times at 
market B. Overall, wait times at market B are more spread out from the average; wait times at market A are more 
concentrated near the average. 


The standard deviation can be used to determine whether a data value is close to or far from the mean. 
Suppose that Rosa and Binh both shop at Market A. Rosa waits for 7 minutes and Binh waits for 1 minute at the 
checkout counter. At market A, the mean wait time is 5 minutes and the standard deviation is 2 minutes. The 
standard deviation can be used to determine whether a data value is close to or far from the mean. 


Rosa waits for 7 minutes: 


e 7 is 2 minutes longer than the average of 5; 2 minutes is equal to one standard deviation. 
e Rosa's wait time of 7 minutes is 2 minutes longer than the average of 5 minutes. 
e Rosa's wait time of 7 minutes is one standard deviation above the average of 5 minutes. 


Binh waits for 1 minute. 


e 1is 4 minutes less than the average of 5; 4 minutes is equal to two standard deviations. 

e Binh's wait time of 1 minute is 4 minutes less than the average of 5 minutes. 

e Binh's wait time of 1 minute is two standard deviations below the average of 5 minutes. 

e A data value that is two standard deviations from the average is just on the borderline for what many 
statisticians would consider to be far from the average. Considering data to be far from the mean if it is more 
than 2 standard deviations away is more of an approximate "rule of thumb" than a rigid rule. In general, the 
shape of the distribution of the data affects how much of the data is further away than 2 standard deviations. 
(We will learn more about this in later chapters.) 


The number line may help you understand standard deviation. If we were to put 5 and 7 on a number line, 7 is to 
the right of 5. We say, then, that 7 is one standard deviation to the right of 5 because 
5 + (1)(2) =7. 


If 1 were also part of the data set, then 1 is two standard deviations to the left of 5 because 
5 + (-2)(2) =1. 


—_———— I 
0 123 45 67 


e In general, a value = mean + (#ofSTDEV)(standard deviation) 

e where #0fSTDEVs = the number of standard deviations 

e 7 is one standard deviation more than the mean of 5 because: 7=5+(1)(2) 
e 1is two standard deviations less than the mean of 5 because: 1=5+(—2)(2) 


The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for a sample and for a 
population: 


¢ sample: z = x + (##ofSTDEV)(s) 
¢ Population: « = + (##ofSTDEV)(o) 


The lower case letter s represents the sample standard deviation and the Greek letter o (sigma, lower case) 
represents the population standard deviation. 


The symbol a is the sample mean and the Greek symbol yu is the population mean. 


Calculating the Standard Deviation 

If x is a number, then the difference "x - mean" is called its deviation. In a data set, there are as many deviations 
as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong 
to a population, in symbols a deviation is x — yw . For sample data, in symbols a deviation is z— z . 


The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are 
data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the 
standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s 
represents the sample standard deviation and the Greek letter o (sigma, lower case) represents the population 
standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate 
of o. 


To calculate the standard deviation, we need to calculate the variance first. The variance is an average of the 
squares of the deviations (the x— x values for a sample, or the x — pu values for a population). The symbol o? 
represents the population variance; the population standard deviation o is the square root of the population 
variance. The symbol s? represents the sample variance; the sample standard deviation s is the square root of the 
sample variance. You can think of the standard deviation as a special average of the deviations. 


If the numbers come from a census of the entire population and not a sample, when we calculate the average of 
the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are 
from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n- 
1, one less than the number of items in the sample. You can see that in the formulas below. 


Formulas for the Sample Standard Deviation 
"Sie _ 2 2 
» = [BEE ore = EE 
e For the sample standard deviation, the denominator is n-1, that is the sample size MINUS 1. 
Formulas for the Population Standard Deviation 


2 2 
ome J FEW) op g = J Et(e-w) 
¢ For the population standard deviation, the denominator is N, the number of items in the population. 


In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f 
is 1. If a value appears three times in the data set or population, f is 3. 


Sampling Variability of a Statistic 

The statistic of a sampling distribution was discussed in Descriptive Statistics: Measuring the Center of the 
Data. How much the statistic varies from one sample to another is known as the sampling variability of a 
statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of 
the mean is an example of a standard error. It is a special standard deviation and is known as the standard 
deviation of the sampling distribution of the mean. You will cover the standard error of the mean in The Central 


Limit Theorem (not now). The notation for the standard error of the mean is a where o is the standard 


deviation of the population and n is the size of the sample. 


Note: In practice, USE A CALCULATOR OR COMPUTER SOFTWARE TO CALCULATE THE 
STANDARD DEVIATION. If you are using a TI-83,83+,84+ calculator, you need to select the appropriate 
standard deviation o, or s; from the summary statistics. We will concentrate on using and interpreting the 
information that the standard deviation gives us. However you should study the following step-by-step example to 
help you understand how the standard deviation measures variation from the mean. 


Example: 

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages 
of her students. The following data are the ages fora SAMPLE of n = 20 fifth grade students. The ages are 
rounded to the nearest half year: 

8) 925 G5 IO WO) WO) WO) WO 3 WO! MOS Oss) Hi TL a AT aU aL TL. '55 TL 5) TL 5) 

Equation: 


9+95x2+10x4+10.5x4+11x6411.5~x3 
— 


= 10.525 
20 


The average age is 10.53 years, rounded to 2 places. 
The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square 
root of the variance. We will explain the parts of the table after calculating s. 


Data Freq. Deviations Deviations?” (Freq.)(Deviations’) 

x ud (x — x) (esa Ce =a) 

9 1 9 — 10.525 = —1.525 (—1.525)? = 2.325625 1 x 2.325625 = 2.325625 
9.5 2 9.5 — 10.525 = —1.025 (—1.025)” = 1.050625 2 x 1.050625 = 2.101250 
10 4 10 — 10.525 = —0.525 (—0.525)” = 0.275625 4 x .275625 = 1.1025 
10.5 4 10.5 — 10.525 = —0.025 (—0.025)” = 0.000625 4 x .000625 = .0025 

11 6 11 — 10.525 = 0.475 (0.475)? = 0.225625 6 x .225625 = 1.35375 


Data Freq. Deviations Deviations? (Freq.)(Deviations’) 


11.5 3 11.5 — 10.525 = 0.975 (0.975)” = 0.950625 3 x .950625 = 2.851875 


The sample variance, s”, is equal to the sum of the last column (9.7375) divided by the total number of data 
values minus one (20 - 1): 

s? = S85 — 0.5125 

The sample standard deviation s is equal to the square root of the sample variance: 

s = V0.5125 =. 0715891 Rounded to two decimal places, s = 0.72 

Typically, you do the calculation for the standard deviation on your calculator or computer. The 
intermediate results are not rounded. This is done for accuracy. 

Exercise: 


Problem: Verify the mean and standard deviation calculated above on your calculator or computer. 


Solution: 
Using the TI-83,83+,84+ Calculators 


e Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear the lists by arrowing up into the 
name. Press CLEAR and arrow down. 

e Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the frequencies (1, 2, 4, 4, 6, 3) into list 

L2. Use the arrow keys to move around. 

Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 1), L2 (2nd 2). Do not forget the 

comma. Press ENTER. 

e £=10.525 

Use Sx because this is sample data (not a population): Sx=0.715891 


For the following problems, recall that value = mean + (#ofSTDEVs)(standard deviation) 
For a sample: z = x + (#ofSTDEVs)(s) 

For a population: x = p + (#ofSTDEVs)( oc) 

For this example, use xz = x + (#ofSTDEVs)(s) because the data is from a sample 


Exercise: 
Problem: Find the value that is 1 standard deviation above the mean. Find (x + 1s). 


Solution: 


(x + 1s) = 10.53 + (1)(0.72) = 11.25 


Exercise: 
Problem: Find the value that is two standard deviations below the mean. Find (x — 2s). 


Solution: 


(x — 2s) = 10.53 — (2)(0.72) = 9.09 


Exercise: 


Problem: Find the values that are 1.5 standard deviations from (below and above) the mean. 


Solution: 


° (x —1.5s) = 10.53 — (1.5)(0.72) = 9.45 
° (« +1.5s) = 10.53 + (1.5)(0.72) = 11.61 


Explanation of the standard deviation calculation shown in the table 

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than 
is the data value 11. The deviations 0.97 and 0.47 indicate that. A positive deviation occurs when the data value is 
greater than the mean. A negative deviation occurs when the data value is less than the mean; the deviation is 
-1.525 for the data value 9. If you add the deviations, the sum is always zero. (For this example, there are n=20 
deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you 
make them positive numbers, and the sum will also be positive. The variance, then, is the average squared 
deviation. 


The variance is a squared measure and does not have the same units as the data. Taking the square root solves the 
problem. The standard deviation measures the spread in the same units as the data. 


Notice that instead of dividing by n=20, the calculation divided by n-1=20-1=19 because the data is a sample. For 
the sample variance, we divide by the sample size minus one (n-1). Why not divide by n? The answer has to do 
with the population variance. The sample variance is an estimate of the population variance. Based on the 
theoretical mathematics that lies behind these calculations, dividing by (n-1) gives a better estimate of the 
population variance. 


Note: Your concentration should be on what the standard deviation tells us about the data. The standard deviation 
is a number which measures how far the data are spread from the mean. Let a calculator or computer do the 
arithmetic. 


The standard deviation, s or a, is either zero or larger than zero. When the standard deviation is 0, there is no 
spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all 
concentrated close to the mean, and is larger when the data values show more variation from the mean. When the 
standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s 
or o very large. 


The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better "feel" 
for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation 
can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that 
the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the 
first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be 
confusing, always graph your data. 


Note:The formula for the standard deviation is at the end of the chapter. 


Example: 
Exercise: 


Problem: Use the following data (first exam scores) from Susan Dean's spring pre-calculus class: 


3342494953555561 6367686869697273 7478808388888890 929494949496100 


e aCreate a chart containing the data, frequencies, relative frequencies, and cumulative relative 
frequencies to three decimal places. 
e bCalculate the following to one decimal place using a TI-83+ or TI-84 calculator: 


iThe sample mean 

iiThe sample standard deviation 
iiiThe median 

ivThe first quartile 

vthe third quartile 

vilQR 


©). (Gb XO} ey 6) 


e cConstruct a box plot and a histogram on the same set of axes. Make comments about the box plot, the 
histogram, and the chart. 


Solution: 
ea 

Data Frequency Relative Frequency Cumulative Relative Frequency 
38) 1 0.032 0.032 
42 1 0.032 0.064 
49 2, 0.065 0.129 
53 1 0.032 0.161 
55 2: 0.065 0.226 
61 il 0.032 0.258 
63 1 0.032 0.29 
67 1 0.032 0.322 
68 2 0.065 0.387 
69 2 0.065 0.452 
72 il 0.032 0.484 
73 1 0.032 0.516 
74 1 0.032 0.548 
78 1 0.032 0.580 


80 1 0.032 0.612 


Data Frequency Relative Frequency Cumulative Relative Frequency 


83 1 0.032 0.644 

88 3 0.097 0.741 

90 1 0.032 O).7/7/3! 

92 il 0.032 0.805 

94 4 0.129 0.934 

96 il 0.032 0.966 

100 il 0.032 0.998 (Why isn't this value 1?) 
eb 

o iThe sample mean = 73.5 

o iiThe sample standard deviation = 17.9 

o iiiThe median = 73 

© ivThe first quartile = 61 

o vThe third quartile = 90 

© vilIQR = 90 - 61 = 29 


e cThe x-axis goes from 32.5 to 100.5; y-axis goes from -2.4 to 15 for the histogram; number of intervals 
is 5 for the histogram so the width of an interval is (100.5 - 32.5) divided by 5 which is equal to 13.6. 
Endpoints of the intervals: starting point is 32.5, 32.5+13.6 = 46.1, 46.1+13.6 = 59.7, 59.7+13.6 = 73.3, 
73.3+13.6 = 86.9, 86.9+13.6 = 100.5 = the ending value; No data values fall on an interval boundary. 


The long left whisker in the box plot is reflected in the left side of the histogram. The spread of the exam scores in 
the lower 50% is greater (73 - 33 = 40) than the spread in the upper 50% (100 - 73 = 27). The histogram, box plot, 
and chart all reflect this. There are a substantial number of A and B grades (80s, 90s, and 100). The histogram 
clearly shows this. The box plot shows us that the middle 50% of the exam scores (IQR = 29) are Ds, Cs, and Bs. 
The box plot also shows us that the lower 25% of the exam scores are Ds and Fs. 


Comparing Values from Different Data Sets 
The standard deviation is useful when comparing data values that come from different data sets. If the data sets 
have different means and standard deviations, it can be misleading to compare the data values directly. 


e For each data value, calculate how many standard deviations the value is away from its mean. 
e Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs. 
#ofSTDEVs _ value—mean 


standard deviation 
¢ Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become: 


= — &£-2z 
Sample L=XH+ZS a ae 
: _ _ op 
Population L=p~+zo z= 
Example: 
Exercise: 
Problem: 


Two students, John and Ali, from different high schools, wanted to find out who had the highest G.P.A. when 
compared to his school. Which student had the highest G.P.A. when compared to his school? 


Student GPA School Mean GPA School Standard Deviation 
John 2.85 3.0 0.7 
Ali 77 80 10 

Solution: 


For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the average, 
for his school. Pay careful attention to signs when comparing and interpreting the answer. 


#ofSTDEVs = value—mean [c= TU 


standard deviation o 
For John, z = ##ofSTDEVs = 243-3° — —0.21 
For Ali, z = #ofSTDEVs = Ts = —0.3 


John has the better G.P.A. when compared to his school because his G.P.A. is 0.21 standard deviations below 
his school's mean while Ali's G.P.A. is 0.3 standard deviations below his school's mean. 


John's z-score of —0.21 is higher than Ali's z-score of —0.3 . For GPA, higher values are better, so we 
conclude that John has the better GPA when compared to his school. 


The following lists give a few facts that provide a little more insight into what the standard deviation tells us about 
the distribution of the data. 
For ANY data set, no matter what the distribution of the data is: 


e At least 75% of the data is within 2 standard deviations of the mean. 

e At least 89% of the data is within 3 standard deviations of the mean. 

e At least 95% of the data is within 4 1/2 standard deviations of the mean. 
e This is known as Chebyshev's Rule. 


For data having a distribution that is MOUND-SHAPED and SYMMETRIC: 


e Approximately 68% of the data is within 1 standard deviation of the mean. 

e Approximately 95% of the data is within 2 standard deviations of the mean. 

¢ More than 99% of the data is within 3 standard deviations of the mean. 

e This is known as the Empirical Rule. 

e It is important to note that this rule only applies when the shape of the distribution of the data is mound- 
shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" probability 
distribution in later chapters. 


**With contributions from Roberta Bloom 


Glossary 


Standard Deviation 
A number that is equal to the square root of the variance and measures how far data values are from their 
mean. Notation: s for sample standard deviation and o for population standard deviation. 


Variance 
Mean of the squared deviations from the mean. Square of the standard deviation. For a set of data, a deviation 
can be represented as x — x where z is a value of the data and z is the sample mean. The sample variance is 
equal to the sum of the squares of the deviations divided by the difference of the sample size and 1. 


Chevyshev’s Theorem 

This module explains Chevyshev’s Theorem as it pertains to the spread of 
non-normal data. Given an data set, Chevyshev’s Theorem gives a worst 
case scenario for the percentage of data within a given number of standard 
deviations from the mean. 


Chevyshev’s Theorem 


The proportion (or fraction) of any data set nine within K standard 
deviations of the mean is always at least 1 - where K is any positive 
number greater then 1. Why is this? 


KE 


For K = 2, the proportion is 1 - x =1- t = 3 hence 2 + ths or 75% of the 
data falls within 2 standard deviations of the mean. 
For K = 3, the proportion is 1- + =1- +=, hence > > ths or 


approximately 89% of the data falls satin | i deviations of the 
mean. 


Example: 

Using the data from the pre-calculus class exams and K = 2, this means 
that at least 75% of the scores fall between 73.5 - 2(17.9) and 73.5 + 
2(17.9), or between 37.7 and 109.3. 

In actual fact all but one data value falls in this range, however 
Chevyshev's Theorem gives the worst case scenario. 


Exercise: 


Problem: 


Using the pre-calculus class exams, what would the range of values be 
for at least 89% of the data according to Chevyshev’s Theorem? 


Solution: 


73.5 — 3(17.9) to 73.5 + 3(17.9) or 19.8 to 127.2 


Example: 

Using Chevyshev’s Theorem, what percent of the data would fall between 
46.65 and 100.35? 

Step 1: Find how far the maximum (or minimum) value is from the mean. 
100.35 — 73.5 = 26.85 

Step 2: How many standard deviations does 26.85 represent? 26.85/17.9 = 
1.5. Hence K = 1.5 


Step 3: If K = 1.5, then the percentage is 1 — a ~~ 0.55556 , or 


approximately 56% 

Given a data set with a mean of 56.3 and a standard deviation of 8.2, use 
this information and Chevyshev’s Theorem to answer the following 
questions. 


Exercise: 


Problem: 


What percent of the data lies within 2.2 standard deviation from the 
mean? 


Solution: 


1 — 557 © 0.793 or 79% 


Exercise: 


Problem: 


For the given sent of data, about 79% of the data falls between which 
two values? 


Solution: 


56.3 — 2.2(8.2) = 38.26 and 56.3 + 2.2(8.2) = 74.34 
Exercise: 


Problem: 
What percent of the data lies between the values 45.64 and 66.96? 
Solution: 


Step 1: Find how far the maximum value is from the mean: 66.96 — 
56.3 = 10.66 


Step 2: How many standard deviations does 10.66 represent? 10.66/8.2 
= 1.3. Hence K = 1.3 


Step 3: If K = 1.3, then the percentage is 1 - i 408 or 
approximately 41% 


Summary of Formulas 


A summary of useful formulas used in examining descriptive statistics 
Commonly Used Symbols 


e The symbol »’ means to add or to find the sum. 

e n= the number of data values in a sample 

e N =the number of people, things, etc. in the population 
e «x =the sample mean 

e s =the sample standard deviation 

e y= the population mean 

e o =the population standard deviation 

f = frequency 

e x = numerical value 


Commonly Used Expressions 


e «*f =A value multiplied by its respective frequency 

e \\ a = The sum of the values 

e \\a*f = The sum of values multiplied by their respective frequencies 

e (x — x) or (x — p) = Deviations from the mean (how far a value is 

from the mean) 

(ax — x)” or (a — p)? = Deviations squared 

e f(x— a)? or f(x — py)? = The deviations squared and multiplied by 
their frequencies 


Mean Formulas: 


2 org = elt 
e wat Or - Lite 


I 


Standard Deviation Formulas: 


Wa2—2)2 vt(e—zy? 

wee Se! 5g = Ehle=z) 
n—-1 n—-1 
(e—u)2 SP eg \e 

Si ca) Grp ite Ht) 


Formulas Relating a Value, the Mean, and the Standard Deviation: 


e value = mean + (#ofSTDEVs)(standard deviation), where #ofSTDEVs 
= the number of standard deviations 

e x = x£+ (HofSTDEVs)(s) 

e x = p+ (#ofSTDEVs)(oc) 


Practice 1: Center of the Data 

This module provides students with opportunities to apply concepts related 
to descriptive statistics. Students are asked to take a set of sample data and 
calculate a series of statistical values for that data. 


Student Learning Outcomes 


e The student will calculate and interpret the center, spread, and location 
of the data. 
e The student will construct and interpret histograms an box plots. 


Given 


Sixty-five randomly selected car salespersons were asked the number of 
cars they generally sell in one week. Fourteen people answered that they 
generally sell three cars; nineteen generally sell four cars; twelve generally 
sell five cars; nine generally sell six cars; eleven generally sell seven cars. 


Complete the Table 


Cumulative 
Data Value Relative Relative 
(# cars) Frequency Frequency Frequency 


Discussion Questions 


Exercise: 


Problem: What does the frequency column sum to? Why? 


Solution: 
65 


Exercise: 


Problem: What does the relative frequency column sum to? Why? 


Solution: 


1 
Exercise: 


Problem: 


What is the difference between relative frequency and frequency for 
each data value? 


Exercise: 


Problem: 


What is the difference between cumulative relative frequency and 
relative frequency for each data value? 


Enter the Data 


Enter your data into your calculator or computer. 


Construct a Histogram 


Determine appropriate minimum and maximum x and y values and the 
scaling. Sketch the histogram below. Label the horizontal and vertical axes 
with words. Include numerical scaling. 


Data Statistics 


Calculate the following values: 
Exercise: 


Problem: Sample mean = Z = 


Solution: 


4.75 


Exercise: 


Problem: Sample standard deviation = s, = 


Solution: 


1.39 


Exercise: 


Problem: Sample size = n = 


Solution: 


65 


Calculations 


Use the table in section 2.11.3 to calculate the following values: 
Exercise: 


Problem: Median = 


Solution: 
4 
Exercise: 


Problem: Mode = 


Solution: 
4 
Exercise: 


Problem: First quartile = 


Solution: 
4 
Exercise: 


Problem: Second quartile = median = 50th percentile = 


Solution: 


4 


Exercise: 


Problem: Third quartile = 


Solution: 
6 


Exercise: 


Problem: Interquartile range (IQR) = en 
Solution: 
6—4=2 

Exercise: 


Problem: 10th percentile = 


Solution: 
3 
Exercise: 


Problem: 70th percentile = 

Solution: 

6 

Exercise: 

Problem: Find the value that is 3 standard deviations: 
e aAbove the mean 
e bBelow the mean 

Solution: 


e a8.93 


e b0.58 


Box Plot 


Construct a box plot below. Use a ruler to measure and scale accurately. 


Interpretation 


Looking at your box plot, does it appear that the data are concentrated 
together, spread out evenly, or concentrated in some areas, but not in 
others? How can you tell? 


Practice 2: Spread of the Data 
Practice exercise for Descriptive Statistics 


Student Learning Outcomes 


e The student will calculate measures of the center of the data. 
e The student will calculate the spread of the data. 


Given 


The population parameters below describe the full-time equivalent number of 
students (FTES) each year at Lake Tahoe Community College from 1976-77 
through 2004-2005. (Source: Graphically Speaking by Bill King, LTCC 
Institutional Research, December 2005). 


Use these values to answer the following questions: 


e «= 1000 FTES 

¢ Median = 1014 FTES 

e g =474FTES 

e First quartile = 528.5 FTES 

e Third quartile = 1447.5 FTES 
e n= 29 years 


Calculate the Values 


Exercise: 


Problem: 


A sample of 11 years is taken. About how many are expected to have a 
FTES of 1014 or above? Explain how you determined your answer. 


Solution: 


6 


Exercise: 


Problem: 75% of all years have a FTES: 
e aAt or below: 
e bAt or above: 

Solution: 


e al447.5 
¢ b528.5 


Exercise: 


Problem: The population standard deviation = 


Solution: 


474 FTES 
Exercise: 


Problem: 


What percent of the FTES were from 528.5 to 1447.5? How do you 
know? 


Solution: 
50% 
Exercise: 
Problem: What is the IQR? What does the IQR represent? 
Solution: 


319 


Exercise: 


Problem: 
How many standard deviations away from the mean is the median? 
Solution: 


0.03 


Additional Information: The population FTES for 2005-2006 through 2010- 
2011 was given in an updated report. (Source: 
http://www.ltcc.edu/data/ResourcePDF/LTCC_FactBook_2010-11.pdf). The 
data are reported here. 


2005- 2006- 2007- 2008- 2009- 2010- 


xear 06 07 08 09 10 14 
Total 1585 1690 1735 1935 2021 1890 
FTES 

Exercise: 
Problem: 


Calculate the mean, median, standard deviation, first quartile, the third 
quartile and the IQR. Round to one decimal place. 


Solution: 


mean = 1809.3 

median = 1812.5 

standard deviation = 151.2 
First quartile = 1690 


Third quartile = 1935 
TQR = 245 


Exercise: 
Problem: 
Construct a boxplot for the FTES for 2005-2006 through 2010-2011 and 
a boxplot for the FTES for 1976-1977 through 2004-2005. 
Exercise: 
Problem: 
Compare the IQR for the FTES for 1976-77 through 2004-2005 with the 


IQR for the FTES for 2005-2006 through 2010-2011. Why do you 
suppose the IQRs are so different? 


Solution: 


Hint: Think about the number of years covered by each time period and 
what happened to higher education during those periods. 


Homework 

Descriptive Statistics: Homework is part of the collection col10555 written 
by Barbara Illowsky and Susan Dean and provides homework questions 
related to lessons about descriptive statistics. 

Exercise: 


Problem: 


Twenty-five randomly selected students were asked the number of 
movies they watched the previous week. The results are as follows: 


Cumulative 

# of Relative Relative 
movies Frequency Frequency Frequency 
0 s) 

1 9 

2 6 

3 4 

4 1 


aFind the sample mean x 

bFind the sample standard deviation, s 
cConstruct a histogram of the data. 
dComplete the columns of the chart. 
eFind the first quartile. 

fFind the median. 

gFind the third quartile. 

hConstruct a box plot of the data. 


e iWhat percent of the students saw fewer than three movies? 
e jFind the 40th percentile. 

e kFind the 90th percentile. 

e IConstruct a line graph of the data. 

e mConstruct a stem plot of the data. 


Solution: 


al.48 
b1.12 
el 
fl 


e 180% 
e jl 
e k3 


Exercise: 


Problem: 


The median age for U.S. blacks currently is 30.9 years; for U.S. whites 
it is 42.3 years. ((Source: 
http://www.usatoday.com/news/nation/story/2012-05-17/minority- 
births-census/55029100/1)) 


e aBased upon this information, give two reasons why the black 
median age could be lower than the white median age. 

¢ bDoes the lower median age for blacks necessarily mean that 
blacks die younger than whites? Why or why not? 


¢ cHow might it be possible for blacks and whites to die at 
approximately the same age, but for the median age for whites to 
be higher? 


Exercise: 
Problem: 
Forty randomly selected students were asked the number of pairs of 


sneakers they owned. Let X = the number of pairs of sneakers owned. 
The results are as follows: 


Relative Cumulative Relative 

xX Frequency Frequency Frequency 

1 2 

2 5 

3 8 

4 12 

is) 12 

7 1 


e aFind the sample mean x 

e bFind the sample standard deviation, s 
e cConstruct a histogram of the data. 

e dComplete the columns of the chart. 

e eFind the first quartile. 


fFind the median. 

gFind the third quartile. 

hConstruct a box plot of the data. 

iWhat percent of the students owned at least five pairs? 
e jFind the 40th percentile. 

e kFind the 90th percentile. 

e IConstruct a line graph of the data 

e mConstruct a stem plot of the data 


Solution: 


e a3.78 
e b1.29 
e e3 
e £4 
e go 


e 132.5% 
e j4 
e k5 
Exercise: 
Problem: 
600 adult Americans were asked by telephone poll, What do you think 
constitutes a middle-class income? The results are below. Also, include 


left endpoint, but not the right endpoint. (Source: Time magazine; 
survey by Yankelovich Partners, Inc.) 


Note:"Not sure" answers were omitted from the results. 


Salary ($) Relative Frequency 
< 20,000 0.02 
20,000 - 25,000 0.09 
25,000 - 30,000 0.19 
30,000 - 40,000 0.26 
40,000 - 50,000 0.18 
50,000 - 75,000 0.17 
75,000 - 99,999 0.02 
100,000+ 0.01 


e aWhat percent of the survey answered "not sure" ? 
e bWhat percent think that middle-class is from $25,000 - $50,000 
i) 


e cConstruct a histogram of the data 


1. iShould all bars have the same width, based on the data? 
Why or why not? 

2. litHow should the <20,000 and the 100,000+ intervals be 
handled? Why? 


e dFind the 40th and 80th percentiles 
e eConstruct a bar graph of the data 


Exercise: 


Problem: 


Following are the published weights (in pounds) of all of the team 
members of the San Francisco 49ers from a previous year (Source: San 
Jose Mercury News) 


177 205 210 210 232 205 185 185 178 210 206 212 184 174 185 242 
188 212 215 247 241 223 220 260 245 259 278 270 280 295 275 285 
290 272 273 280 285 286 200 215 185 230 250 241 190 260 250 302 
265 290 276 228 265 


aOrganize the data from smallest to largest value. 

bFind the median. 

cFind the first quartile. 

dFind the third quartile. 

eConstruct a box plot of the data. 

fThe middle 50% of the weights are from to 

gIf our population were all professional football players, would 
the above data be a sample of weights or the population of 
weights? Why? 

hIf our population were the San Francisco 49ers, would the above 
data be a sample of weights or the population of weights? Why? 
iAssume the population was the San Francisco 49ers. Find: 


ithe population mean, j. 

iithe population standard deviation, o. 

iiithe weight that is 2 standard deviations below the mean. 
ivWhen Steve Young, quarterback, played football, he 
weighed 205 pounds. How many standard deviations above 
or below the mean was he? 


QO -O -9:-0 


jThat same year, the mean weight for the Dallas Cowboys was 
240.08 pounds with a standard deviation of 44.38 pounds. Emmit 
Smith weighed in at 209 pounds. With respect to his team, who 


was lighter, Smith or Young? How did you determine your 
answer? 


Solution: 


e b241 
e ¢205.5 
e d272.5 
ee 


174 205.5 241 272.5 302 


e £205.5, 272.5 
e gsample 

e hpopulation 
ej 


1236.34 

1137.50 

i1161.34 

iv0.84 std. dev. below the mean 


o Oo 0 90 


e jyYoung 


Exercise: 


Problem: 


An elementary school class ran 1 mile with a mean of 11 minutes and a 
standard deviation of 3 minutes. Rachel, a student in the class, ran 1 
mile in 8 minutes. A junior high school class ran 1 mile with a mean of 
9 minutes and a standard deviation of 2 minutes. Kenji, a student in the 
class, ran 1 mile in 8.5 minutes. A high school class ran 1 mile with a 
mean of 7 minutes and a standard deviation of 4 minutes. Nedda, a 
student in the class, ran 1 mile in 8 minutes. 


e aWhy is Kenji considered a better runner than Nedda, even 
though Nedda ran faster than he? 

e bWho is the fastest runner with respect to his or her class? 
Explain why. 


Exercise: 


Problem: 


In a survey of 20 year olds in China, Germany and America, people 
were asked the number of foreign countries they had visited in their 
lifetime. The following box plots display the results. 


China 


ee a ee 


0 1 2 3 4 3 6 7 8 9 10 11 


e aln complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected. 

e bExplain how it is possible that more Americans than Germans 
surveyed have been to over eight foreign countries. 

e cCompare the three box plots. What do they imply about the 
foreign travel of twenty year old residents of the three countries 
when compared to each other? 


Exercise: 


Problem: 


One hundred teachers attended a seminar on mathematical problem 
solving. The attitudes of a representative sample of 12 of the teachers 
were measured before and after the seminar. A positive number for 
change in attitude indicates that a teacher's attitude toward math 
became more positive. The twelve change scores are as follows: 


38-1205-31-165-2 


e aWhat is the mean change score? 

¢ bWhat is the standard deviation for this population? 

e cWhat is the median change score? 

e dFind the change score that is 2.2 standard deviations below the 
mean. 


Exercise: 


Problem: 


Three students were applying to the same graduate school. They came 
from schools with different grading systems. Which student had the 
best G.P.A. when compared to his school? Explain how you 
determined your answer. 


School Ave. School Standard 


Student G.P.A. G.P.A. Deviation 
Thuy 2) 3.2 0.8 
Vichet 87 75 20 
Kamala 8.6 8 0.4 
Solution: 
Kamala 
Exercise: 


Problem: Given the following box plot: 


0 2 10 12 13 


e aWhich quarter has the smallest spread of data? What is that 
spread? 

e bWhich quarter has the largest spread of data? What is that 
spread? 

e cFind the Inter Quartile Range (IQR). 

e dAre there more data in the interval 5 - 10 or in the interval 10 - 
13? How do you know this? 

e eWhich interval has the fewest data in it? How do you know this? 


I 0-2 
II2-4 
TIT10-12 
IV12-13 


o Oo 0 0 


Exercise: 


Problem: Given the following box plot: 


0 20 100 150 


e aThink of an example (in words) where the data might fit into the 
above box plot. In 2-5 sentences, write down the example. 

e bWhat does it mean to have the first and second quartiles so close 
together, while the second to fourth quartiles are far apart? 


Exercise: 
Problem: 


Santa Clara County, CA, has approximately 27,873 Japanese- 
Americans. Their ages are as follows. (Source: West magazine) 


Age Group Percent of Community 
0-17 18.9 

18-24 8.0 

25-34 22.8 

35-44 15.0 


45-54 13.1 


Age Group Percent of Community 
55-64 11.9 


65+ 10.3 


e aConstruct a histogram of the Japanese-American community in 
Santa Clara County, CA. The bars will not be the same width for 
this example. Why not? 

e¢ bWhat percent of the community is under age 35? 

e cWhich box plot most resembles the information above? 


0 24 34 53 =100 
ii. 
0 18 34 45 =100 
iii. 
0 24 25 54 =100 
Exercise: 
Problem: 


Suppose that three book publishers were interested in the number of 
fiction paperbacks adult consumers purchase per month. Each 
publisher conducted a survey. In the survey, each asked adult 
consumers the number of fiction paperbacks they had purchased the 
previous month. The results are below. 


# of books Freq. Rel. Freq. 


0 10 
1 12 
2 16 
3 12 
4 8 
5 6 
6 yi 
8 2 
Publisher A 

# of books Freq. Rel. Freq. 
0 18 
1 24 
2 24 
3 22 


# of books Freq. Rel. Freq. 


5 10 
7 5 
9 1 
Publisher B 
# of books Freq. Rel. Freq. 
0-1 20 
2-3 35 
4-5 12 
6-7 2 
8-9 1 
Publisher C 


e aFind the relative frequencies for each survey. Write them in the 
charts. 

e bUsing either a graphing calculator, computer, or by hand, use the 
frequency column to construct a histogram for each publisher's 
survey. For Publishers A and B, make bar widths of 1. For 
Publisher C, make bar widths of 2. 

e cIn complete sentences, give two reasons why the graphs for 
Publishers A and B are not identical. 


e dWould you have expected the graph for Publisher C to look like 
the other two graphs? Why or why not? 

e eMake new histograms for Publisher A and Publisher B. This 
time, make bar widths of 2. 

e fNow, compare the graph for Publisher C to the new graphs for 
Publishers A and B. Are the graphs more similar or more 
different? Explain your answer. 


Exercise: 


Problem: 


Often, cruise ships conduct all on-board transactions, with the 
exception of gambling, on a cashless basis. At the end of the cruise, 
guests pay one bill that covers all on-board transactions. Suppose that 
60 single travelers and 70 couples were surveyed as to their on-board 
bills for a seven-day cruise from Los Angeles to the Mexican Riviera. 
Below is a summary of the bills for each group. 


Amount($) 
51-100 
101-150 
151-200 
201-250 
251-300 


301-350 


Frequency Rel. Frequency 
5 

10 

15 

15 


10 


Singles 


Amount($) Frequency Rel. Frequency 
100-150 fs) 
201-250 is) 
251-300 fs) 
301-350 is) 
351-400 10 
401-450 10 
451-500 10 
501-550 10 
551-600 fs) 
601-650 fs) 
Couples 


e aFill in the relative frequency for each group. 

e bConstruct a histogram for the Singles group. Scale the x-axis by 
$50. widths. Use relative frequency on the y-axis. 

e cConstruct a histogram for the Couples group. Scale the x-axis by 
$50. Use relative frequency on the y-axis. 

e dCompare the two graphs: 


© iList two similarities between the graphs. 
o jiList two differences between the graphs. 
o iiiOverall, are the graphs more similar or different? 


e eConstruct a new graph for the Couples by hand. Since each 
couple is paying for two individuals, instead of scaling the x-axis 
by $50, scale it by $100. Use relative frequency on the y-axis. 

e fCompare the graph for the Singles with the new graph for the 
Couples: 


© iList two similarities between the graphs. 
© iiOverall, are the graphs more similar or different? 


e iBy scaling the Couples graph differently, how did it change the 
way you compared it to the Singles? 

e jBased on the graphs, do you think that individuals spend the 
Same amount, more or less, as singles as they do person by person 
in a couple? Explain why in one or two complete sentences. 


Exercise: 
Problem: 
Refer to the following histograms and box plot. Determine which of 


the following are true and which are false. Explain your solution to 
each part in complete sentences. 


1 3 6 


aThe medians for all three graphs are the same. 

bWe cannot determine if any of the means for the three graphs is 
different. 

cThe standard deviation for (b) is larger than the standard 
deviation for (a). 

dWe cannot determine if any of the third quartiles for the three 
graphs is different. 


Solution: 
e alrue 
e bTrue 


e clrue 
e dFalse 


Exercise: 


Problem: Refer to the following box plots. 


Data 1 

0 ys 4 i | 
Data 2 

0 2 7 


e aln complete sentences, explain why each statement is false. 


°o iData 1 has more data values above 2 than Data 2 has above 
2, 

o jiThe data sets cannot have the same mode. 

o iiiFor Data 1, there are more data values below 4 than there 
are above 4. 


¢ bFor which group, Data 1 or Data 2, is the value of “7” more 
likely to be an outlier? Explain why in complete sentences 


Exercise: 


Problem: 


In a recent issue of the IEEE Spectrum, 84 engineering conferences 
were announced. Four conferences lasted two days. Thirty-six lasted 
three days. Eighteen lasted four days. Nineteen lasted five days. Four 
lasted six days. One lasted seven days. One lasted eight days. One 
lasted nine days. Let X = the length (in days) of an engineering 
conference. 


e aOrganize the data in a chart. 

e bFind the median, the first quartile, and the third quartile. 

e cFind the 65th percentile. 

e dFind the 10th percentile. 

e eConstruct a box plot of the data. 

e fThe middle 50% of the conferences last from days to 

days. 

e gCalculate the sample mean of days of engineering conferences. 

e hCalculate the sample standard deviation of days of engineering 
conferences. 

e iFind the mode. 

e jlf you were planning an engineering conference, which would 
you choose as the length of the conference: mean; median; or 
mode? Explain why you made that choice. 

e kGive two reasons why you think that 3 - 5 days seem to be 
popular lengths of engineering conferences. 


Solution: 


e b4,3,5 
e c4 
e d3 
ee 


° £3,5 
g3.94 
h1.28 
i3 
jmode 


Exercise: 


Problem: 


A survey of enrollment at 35 community colleges across the United 
States yielded the following figures (source: Microsoft Bookshelf): 


6414 1550 2109 9350 21828 4300 5944 5722 2825 2044 5481 5200 
9853 2750 10012 6357 27000 9414 7681 3200 17500 9200 7380 
18314 6557 13713 17768 7493 2771 2861 1263 7285 28165 5080 
11622 


e aOrganize the data into a chart with five intervals of equal width. 
Label the two columns "Enrollment" and "Frequency." 

¢ bConstruct a histogram of the data. 

e clf you were to build a new community college, which piece of 
information would be more valuable: the mode or the mean? 

e dCalculate the sample mean. 

e eCalculate the sample standard deviation. 

e fA school with an enrollment of 8000 would be how many 
standard deviations away from the mean? 


Exercise: 


Problem: 


The median age of the U.S. population in 1980 was 30.0 years. In 
1991, the median age was 33.1 years. (Source: Bureau of the Census) 


e¢ aWhat does it mean for the median age to rise? 

e bGive two reasons why the median age could rise. 

e cFor the median age to rise, is the actual number of children less 
in 1991 than it was in 1980? Why or why not? 


Solution: 


e cMaybe 


Exercise: 


Problem: 


A survey was conducted of 130 purchasers of new BMW 3 series cars, 
130 purchasers of new BMW 5 series cars, and 130 purchasers of new 
BMW 7 series cars. In it, people were asked the age they were when 
they purchased their car. The following box plots display the results. 


BMW 3 series 


25 30 35 40 45 50 55 60 65 70 75 80 


e aln complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected for that car 


series. 

e bWhich group is most likely to have an outlier? Explain how you 
determined that. 

e cCompare the three box plots. What do they imply about the age 
of purchasing a BMW from the series when compared to each 
other? 

e dLook at the BMW 5 series. Which quarter has the smallest 
spread of data? What is that spread? 

e eLook at the BMW 5 series. Which quarter has the largest spread 
of data? What is that spread? 

e fLook at the BMW 5 series. Estimate the Inter Quartile Range 
(IQR). 

e gLook at the BMW 5 series. Are there more data in the interval 
31-38 or in the interval 45-55? How do you know this? 

e hLook at the BMW 5 series. Which interval has the fewest data in 
it? How do you know this? 


° 431-35 
© 1138-41 
° 11141-64 


Exercise: 
Problem: 


The following box plot shows the U.S. population for 1990, the latest 
available year. (Source: Bureau of the Census, 1990 Census) 


0 17 33 50 #105 


e aAre there fewer or more children (age 17 and under) than senior 
citizens (age 65 and over)? How do you know? 

e b12.6% are age 65 and over. Approximately what percent of the 
population are of working age adults (above age 17 to age 65)? 


Solution: 


e amore children 
e b62.4% 


Exercise: 


Problem: 


Javier and Ercilia are supervisors at a shopping mall. Each was given 
the task of estimating the mean distance that shoppers live from the 
mall. They each randomly surveyed 100 shoppers. The samples 
yielded the following information: 


Javier Ercilla 
x 6.0 miles 6.0 miles 
S 4.0 miles 7.0 miles 


e aHow can you determine which survey was correct ? 
e bExplain what the difference in the results of the surveys implies 


about the data. 
clf the two histograms depict the distribution of values for each 


supervisor, which one depicts Ercilia's sample? How do you 
know? 


e dif the two box plots depict the distribution of values for each 
supervisor, which one depicts Ercilia’s sample? How do you 


know? 
i. ii. 
0 1 6 14 21 0 4 6 9 12 
Exercise: 


Problem: Student grades on a chemistry exam were: 
77, 78, 76, 81, 86, 51, 79, 82, 84, 99 
e aConstruct a stem-and-leaf plot of the data. 


e bAre there any potential outliers? If so, which scores are they? 
Why do you consider them outliers? 


Solution: 


e b51,99 


Try these multiple choice questions (Exercises 24 - 30). 


The next three questions refer to the following information. We are 
interested in the number of years students in a particular elementary 
statistics class have lived in California. The information in the following 
table is from the entire section. 


Number of years Frequency 


Number of years Frequency 


7 | 
14 3 
15 1 
18 1 
19 4 
20 3 
22 1 
23 1 
26 1 
AO 2 
42 2 
Total = 20 
Exercise: 


Problem: What is the IQR? 


e A8 

e Bll 
e C15 
e D35 


Solution: 


A 


Exercise: 


Problem: What is the mode? 


e Al9 

e B19.5 

e C14 and 20 
e D22.65 


Solution: 
A 
Exercise: 
Problem: Is this a sample or the entire population? 


e Asample 
e Bentire population 
¢ Cneither 


Solution: 
B 


The next two questions refer to the following table. X = the number of 
days per week that 100 clients use a particular exercise facility. 


x Frequency 


0 3 
1 12 
2 33 
3 28 
4 11 
5 9 
6 4 
Exercise: 


Problem: The 80th percentile is: 


e AS 
e B80 
e C3 
e D4 


Solution: 


D 
Exercise: 


Problem: 


The number that is 1.5 standard deviations BELOW the mean is 
approximately: 


e AO.7 


e B48 
e C-2.8 
e DCannot be determined 


Solution: 


A 


The next two questions refer to the following histogram. Suppose one 
hundred eleven people who shopped in a special T-shirt store were asked 
the number of T-shirts they own costing more than $19 each. 


Relative 
Frequency 


40/111 39/111 


30/111 
25/111 


23/111 


20/111 
17/111 


10/111 


Number of T-shirts costing more than $19 each 


Exercise: 


Problem: 


The percent of people that own at most three (3) T-shirts costing more 
than $19 each is approximately: 


e A21 
e B59 
e C41 
e DCannot be determined 


Solution: 


S 
Exercise: 
Problem: 


If the data were collected by asking the first 111 people who entered 
the store, then the type of sampling is: 


e Acluster 

¢ Bsimple random 
e Cstratified 

e Dconvenience 


Solution: 


D 
Exercise: 


Problem: 


Below are the 2010 obesity rates by U.S. states and Washington, 
DC.(Source: http://www.cdc.gov/obesity/data/adult.html)) 


State 
Alabama 
Alaska 


Arizona 


Arkansas 


California 
Colorado 


Connecticut 


Delaware 


Washington, 
DC 


Florida 
Georgia 
Hawaii 
Idaho 


Illinois 


Indiana 


Percent 
(%) 


32.2 
24.5 


24.3 


30.1 


24.0 
21.0 


2225 


28.0 


22.2 


26.6 
29.6 
22.7 
26.5 


28.2 


29.6 


State 
Montana 
Nebraska 
Nevada 


New 
Hampshire 


New Jersey 
New Mexico 
New York 


North 
Carolina 


North 
Dakota 


Ohio 
Oklahoma 
Oregon 
Pennsylvania 
Rhode Island 


South 
Carolina 


Percent 
(%) 


23.0 
26.9 


22.4 


25.0 


23.8 
25.1 


23.9 


27.8 


272 


20.2 
30.4 
26.8 
28.6 


25:0 


31.5 


Percent Percent 


State (%) State (%) 
Iowa 28.4 =e 2753 
Kansas 29.4 Tennessee 30.8 
Kentucky 31.3 Texas 31.0 
Louisiana 31.0 Utah 225 
Maine 26.8 Vermont 2a 
Maryland 27 Virginia 26.0 
Massachusetts 23.0 Washington 25.5 
Michigan 30.9 aah 32.5 
Minnesota 24.8 Wisconsin 26.3 
Mississippi 34.0 Wyoming 25.1 
Missouri 30.5 


e a.Construct a bar graph of obesity rates of your state and the four 
States closest to your state. Hint: Label the x-axis with the states. 

e b.Use a random number generator to randomly pick 8 states. 
Construct a bar graph of the obesity rates of those 8 states. 

e c.Construct a bar graph for all the states beginning with the letter 
"AM 

e d.Construct a bar graph for all the states beginning with the letter 


Solution: 


Example solution for b using the random number generator for the Ti- 
84 Plus to generate a simple random sample of 8 states. Instructions 
are below. 


e Number the entries in the table 1 - 51 (Includes Washington, DC; 
Numbered vertically) 

e Press MATH 

e Arrow over to PRB 

e Press 5:randInt( 

e Enter 51,1,8) 


Eight numbers are generated (use the right arrow key to scroll through 
the numbers). The numbers correspond to the numbered states (for this 
example: {47 21 9 23 51 13 25 4}. If any numbers are repeated, 
generate a different number by using 5:randInt(51,1)). Here, the states 
(and Washington DC) are {Arkansas, Washington DC, Idaho, 
Maryland, Michigan, Mississippi, Virginia, Wyoming}. Corresponding 
percents are {28.7 21.8 24.5 26 28.9 32.8 25 24.6}. 


40 


35 


30 


25 

Percent (%) 20 
15 

10 

5 


ie) 
Arkansas WashDC Idaho Maryland Michigan Mississippi Virginia Wyoming 


Exercise: 


Problem: 


A music school has budgeted to purchase 3 musical instruments. They 
plan to purchase a piano costing $3000, a guitar costing $550, anda 
drum set costing $600. The mean cost for a piano is $4,000 with a 
standard deviation of $2,500. The mean cost for a guitar is $500 with a 
standard deviation of $200. The mean cost for drums is $700 with a 
standard deviation of $100. Which cost is the lowest, when compared 
to other instruments of the same type? Which cost is the highest when 
compared to other instruments of the same type. Justify your answer 
numerically. 


Solution: 


For pianos, the cost of the piano is 0.4 standard deviations BELOW 
the mean. For guitars, the cost of the guitar is 0.25 standard deviations 
ABOVE the mean. For drums, the cost of the drum set is 1.0 standard 
deviations BELOW the mean. Of the three, the drums cost the lowest 
in comparison to the cost of other instruments of the same type. The 
guitar cost the most in comparison to the cost of other instruments of 
the same type. 


Exercise: 
Problem: 
Suppose that a publisher conducted a survey asking adult consumers 
the number of fiction paperback books they had purchased in the 
previous month. The results are summarized in the table below. (Note 


that this is the data presented for publisher B in homework exercise 
13); 


# of books Freq. Rel. Freq. 


# of books Freq. Rel. Freq. 


0 18 

1 24 

2 24 

3 22 

4 15 

5 10 

7 5 

9 1 

Publisher B 
a. Are there any outliers in the data? Use an appropriate numerical 


Lary 


test involving the IQR to identify outliers, if any, and clearly state 
your conclusion. 


. If a data value is identified as an outlier, what should be done 


about it? 


. Are any data values further than 2 standard deviations away from 


the mean? In some situations, statisticians may use this criteria to 
identify data values that are unusual, compared to the other data 
values. (Note that this criteria is most appropriate to use for data 
that is mound-shaped and symmetric, rather than for skewed 
data.) 


. Do parts (a) and (c) of this problem give the same answer? 
. Examine the shape of the data. Which part, (a) or (c), of this 


question gives a more appropriate result for this data? 


. Based on the shape of the data which is the most appropriate 


measure of center for this data: mean, median or mode? 


Solution: 


e IQR=4-1=3; Q1—1.5*IQR = 1 —1.5(8) = -3.5; Q3 + 
1.5*IQR = 4 + 1.5(3) = 8.5 ;The data value of 9 is larger than 8.5. 
The purchase of 9 books in one month is an outlier. 

e The outlier should be investigated to see if there is an error or 
some other problem in the data; then a decision whether to 
include or exclude it should be made based on the particular 
situation. If it was a correct value then the data value should 
remain in the data set. If there is a problem with this data value, 
then it should be corrected or removed from the data. For 
example: If the data was recorded incorrectly (perhaps a 9 was 
miscoded and the correct value was 6) then the data should be 
corrected. If it was an error but the correct value is not known it 
should be removed from the data set. 

e xbar— 2s = 2.45 — 2*1.88 = -1.31 ; xbar + 2s = 2.45 + 2*1.88 = 
6.21 ; Using this method, the five data values of 7 books 
purchased and the one data value of 9 books purchased would be 
considered unusual. 

e No: part (a) identifies only the value of 9 to be an outlier but part 
(c) identifies both 7 and 9. 

e The data is skewed (to the right). It would be more appropriate to 
use the method involving the IQR in part (a), identifying only the 
one value of 9 books purchased as an outlier. Note that part (c) 
remarks that identifying unusual data values by using the criteria 
of being further than 2 standard deviations away from the mean is 
most appropriate when the data are mound-shaped and 
symmetric. 

e The data are skewed to the right. For skewed data it is more 
appropriate to use the median as a measure of center. 


**Fxercises 32 and 33 contributed by Roberta Bloom 


Descriptive Statistics: Descriptive Statistics Lab (edited: Teegarden) 
Labs changed to incorporate mini-tabs. 


Descriptive Statistics Lab 


Name: 


A. Student Learning Objectives 
¢ The student will construct a histogram and a box plot. 
¢ The student will calculate univariate statistics. 


¢ The student will examine the graphs to interpret what the data implies. 


B. Collect the Data 
Record the number of pairs of shoes you own: 


1. Randomly survey 20 people. Record their values. Survey Results 


2. Construct a histogram using Minitab. Choose an appropriate scale and 
use boundary values (cut points). 


3. Calculate the following: Be sure to include the formulas and the 
appropriate values. Show your work 


a bo 
es 


4. Are the data discrete or continuous? How do you know? Use complete 
sentences. 


5. Describe the shape of the histogram. Use 2 — 3 complete sentences. 


C. Analyze the Data 

1. Determine the following and show your work where appropriate: 
¢ Minimum value = 

¢ Median = 

¢ Maximum value = 

¢ First quartile = 

¢ Third quartile = 

- IQR = 

2. Using Minitab, construct a box plot of data. 


3. What does the shape of the box plot imply about the concentration of 
data? Use 2 — 3 complete sentences. 


4. What does the IQR represent in this problem? (reference your values) 
5. Are there any potential outliers? Which value(s) is (are) it (they)? 


Use the formula to calculate the two end values used to determine if a data 
value is an outlier. 


upper = 
lower = 

6. Show your work to find the value that is 1.5 standard deviations: 
a. Above the mean: 


b. Below the mean: 


c. What percent of the data does Chebyshev’s theorem state lies within 1.5 
standard deviations of the mean? (show your work.) 


d. What percentage of your data actually falls within 1.5 standard deviations 
of the mean? How does this compare to the value you calculated in part c 
above? 


7. How does the standard deviation help you to determine concentration of 
the data and whether or not there are potential outliers? 


Probability Topics 
This module introduces the concept of Probability, the chance of an event 
occurring. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Understand and use the terminology of probability. 

e Determine whether two events are mutually exclusive and whether two 
events are independent. 

¢ Calculate probabilities using the Addition Rules and Multiplication 
Rules. 

e Construct and interpret Contingency Tables. 

e Construct and interpret Venn Diagrams (optional). 

e Construct and interpret Tree Diagrams (optional). 


Introduction 


It is often necessary to "guess" about the outcome of an event in order to 
make a decision. Politicians study polls to guess their likelihood of winning 
an election. Teachers choose a particular course of study based on what they 
think students can comprehend. Doctors choose the treatments needed for 
various diseases based on their assessment of likely results. You may have 
visited a casino where people play games chosen because of the belief that 
the likelihood of winning is good. You may have chosen your course of 
study based on the probable availability of jobs. 


You have, more than likely, used probability. In fact, you probably have an 
intuitive sense of probability. Probability deals with the chance of an event 
occurring. Whenever you weigh the odds of whether or not to do your 
homework or to study for an exam, you are using probability. In this 
chapter, you will learn to solve probability problems using a systematic 
approach. 


Optional Collaborative Classroom Exercise 


Your instructor will survey your class. Count the number of students in the 
class today. 


e Raise your hand if you have any change in your pocket or purse. 
Record the number of raised hands. 

e Raise your hand if you rode a bus within the past month. Record the 
number of raised hands. 

e Raise your hand if you answered "yes" to BOTH of the first two 
questions. Record the number of raised hands. 


Use the class data as estimates of the following probabilities. P(change) 
means the probability that a randomly chosen person in your class has 
change in his/her pocket or purse. P(bus) means the probability that a 
randomly chosen person in your class rode a bus within the last month and 
so on. Discuss your answers. 


e Find P(change). 

e Find P(bus). 

e Find P(change and bus) Find the probability that a randomly chosen 
student in your class has change in his/her pocket or purse and rode a 
bus within the last month. 

e Find P(change| bus) Find the probability that a randomly chosen 
student has change given that he/she rode a bus within the last month. 
Count all the students that rode a bus. From the group of students who 
rode a bus, count those who have change. The probability is equal to 
those who have change and rode a bus divided by those who rode a 
bus. 


Terminology 

Probability: Terminology is part of the collection col10555 written by 
Barbara Illowsky and Susan Dean defines key terms related to Probability 
and has contributions from Roberta Bloom. 


Probability is a measure that is associated with how certain we are of 
outcomes of a particular experiment or activity. An experiment is a 
planned operation carried out under controlled conditions. If the result is 
not predetermined, then the experiment is said to be a chance experiment. 
Flipping one fair coin twice is an example of an experiment. 


The result of an experiment is called an outcome. A sample space is a set 
of all possible outcomes. Three ways to represent a sample space are to list 
the possible outcomes, to create a tree diagram, or to create a Venn diagram. 
The uppercase letter S' is used to denote the sample space. For example, if 
you flip one fair coin, S = {H, T} where H = heads and T = tails are the 
outcomes. 


An event is any combination of outcomes. Upper case letters like A and B 
represent events. For example, if the experiment is to flip one fair coin, 
event A might be getting at most one head. The probability of an event A is 
written P(A). 


The probability of any outcome is the long-term relative frequency of 
that outcome. Probabilities are between 0 and 1, inclusive (includes 0 and 
1 and all numbers between these values). P(A) = 0 means the event A can 
never happen. P(A) = 1 means the event A always happens. P(A) = 0.5 
means the event A is equally likely to occur or not to occur. For example, if 
you flip one fair coin repeatedly (from 20 to 2,000 to 20,000 times) the 
relative fequency of heads approaches 0.5 (the probability of heads). 


Equally likely means that each outcome of an experiment occurs with 
equal probability. For example, if you toss a fair, six-sided die, each face 
(1, 2, 3, 4, 5, or 6) is as likely to occur as any other face. If you toss a fair 
coin, a Head(H) and a Tail(T) are equally likely to occur. If you randomly 
guess the answer to a true/false question on an exam, you are equally likely 
to select a correct answer or an incorrect answer. 


To calculate the probability of an event A when all outcomes in the 
sample space are equally likely, count the number of outcomes for event 
A and divide by the total number of outcomes in the sample space. For 
example, if you toss a fair dime and a fair nickel, the sample space is 
{HH, TH, HT, TT} where T = tails and H = heads. The sample space 
has four outcomes. A = getting one head. There are two outcomes 

{HT, TH}. P(A) =2. 
Suppose you roll one fair six-sided die, with the numbers {1,2,3,4,5,6} on 
its faces. Let event & = rolling a number that is at least 5. There are two 
outcomes {5, 6}. P(E) =2. If you were to roll the die only a few times, 


you would not be surprised if your observed results did not match the 
probability. If you were to roll the die a very large number of times, you 
would expect that, overall, 2/6 of the rolls would result in an outcome of "at 
least 5". You would not expect exactly 2/6. The long-term relative 
frequency of obtaining this result would approach the theoretical probability 
of 2/6 as the number of repetitions grows larger and larger. 


This important characteristic of probability experiments is the known as the 
Law of Large Numbers: as the number of repetitions of an experiment is 
increased, the relative frequency obtained in the experiment tends to 
become closer and closer to the theoretical probability. Even though the 
outcomes don't happen according to any set pattern or order, overall, the 
long-term observed relative frequency will approach the theoretical 
probability. (The word empirical is often used instead of the word 
observed.) The Law of Large Numbers will be discussed again in Chapter 7. 


It is important to realize that in many situations, the outcomes are not 
equally likely. A coin or die may be unfair, or biased . Two math 
professors in Europe had their statistics students test the Belgian 1 Euro 
coin and discovered that in 250 trials, a head was obtained 56% of the time 
and a tail was obtained 44% of the time. The data seem to show that the 
coin is not a fair coin; more repetitions would be helpful to draw a more 
accurate conclusion about such bias. Some dice may be biased. Look at the 
dice in a game you have at home; the spots on each face are usually small 
holes carved out and then painted to make the spots visible. Your dice may 
or may not be biased; it is possible that the outcomes may be affected by the 


slight weight differences due to the different numbers of holes in the faces. 
Gambling casinos have a lot of money depending on outcomes from rolling 
dice, so casino dice are made differently to eliminate bias. Casino dice have 
flat faces; the holes are completely filled with paint having the same density 
as the material that the dice are made out of so that each face is equally 
likely to occur. Later in this chapter we will learn techniques to use to work 
with probabilities for events that are not equally likely. 


"OR" Event: 

An outcome is in the event A OR B if the outcome is in A or is in B or is 
in both A and B. For example, let A = {1, 2, 3, 4, 5} and 

B= {4, 5,6, 7,8}. A OR B = (1, 2, 3, 4, 5, 6, 7, 8}. Notice that 4 
and 5 are NOT listed twice. 


"AND" Event: 

An outcome is in the event A AND B if the outcome is in both A and B at 
the same time. For example, let A and B be {1, 2, 3, 4, 5} and 

{4, 5, 6, 7, 8}, respectively. Then A AND B = {4, 5}. 


The complement of event A is denoted A’ (read "A prime"). A’ consists of 
all outcomes that are NOT in A. Notice that P(A) + P(A’) = 1. For 
example, let S = {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then, 

A’ = {5, 6}. P(A) =4, P(A’) =2, and P(A) + P(A’) =4+2=1 
The conditional probability of A given B is written P(A|B). P(A|B) is 
the probability that event A will occur given that the event B has already 
occurred. A conditional reduces the sample space. We calculate the 
probability of A from the reduced sample space B. The formula to calculate 
P(A|B) is 


P(A|B)= SST a B) 


where P(B) is greater than 0. 
For example, suppose we toss one fair, six-sided die. The sample space 


S = {1, 2, 3, 4, 5, 6}. Let A = face is 2 or 3 and B = face is even (2, 4, 6). 
To calculate P(A|B), we count the number of outcomes 2 or 3 in the 


sample space B = {2, 4, 6}. Then we divide that by the number of 
outcomes in B (and not S). 


We get the same result by using the formula. Remember that S has 6 
outcomes. 


P(A|B) = 
P(A andB) __ (the number of outcomes that are 2 or 3 andeveninS)/6 1/6 1 
P(B) (the number of outcomes that are even in S) / 6 ~ 3/6 3 


Understanding Terminology and Symbols 

It is important to read each problem carefully to think about and understand 
what the events are. Understanding the wording is the first very important 
step in solving probability problems. Reread the problem several times if 
necessary. Clearly identify the event of interest. Determine whether there is 
a condition stated in the wording that would indicate that the probability is 
conditional; carefully identify the condition, if any. 

Exercise: 


Problem: 


In a particular college class, there are male and female students. Some 
students have long hair and some students have short hair. Write the 
symbols for the probabilities of the events for parts (a) through (j) 
below. (Note that you can't find numerical answers here. You were not 
given enough information to find any probability values yet; 
concentrate on understanding the symbols.) 


e Let F be the event that a student is female. 

e Let M be the event that a student is male. 

e Let S be the event that a student has short hair. 
e Let L be the event that a student has long hair. 


e a The probability that a student does not have long hair. 

¢ b The probability that a student is male or has short hair. 

¢ c The probability that a student is a female and has long hair. 

e d The probability that a student is male, given that the student has 
long hair. 


e e The probability that a student has long hair, given that the 
student is male. 

e f Of all the female students, the probability that a student has 
short hair. 

e g Of all students with long hair, the probability that a student is 
female. 

e h The probability that a student is female or has long hair. 

e i The probability that a randomly selected student is a male 
student with short hair. 

e j The probability that a student is female. 


Solution: 


e a P(L')=P(S) 
b P(M or S) 
c P(F and L) 
d P(M|L) 

e P(L|M) 

f P(S|F) 

g P(F|L) 

h P(F or L) 
e i P(M andS) 
PCP) 


**With contributions from Roberta Bloom 


Glossary 


Conditional Probability 
The likelihood that an event will occur given that another event has 
already occurred. 


Equally Likely 
Each outcome of an experiment has the same probability. 


Experiment 
A planned activity carried out under controlled conditions. 


Event 
A subset in the set of all outcomes of an experiment. The set of all 
outcomes of an experiment is called a sample space and denoted 
usually by S. An event is any arbitrary subset in S. It can contain one 
outcome, two outcomes, no outcomes (empty subset), the entire 
sample space, etc. Standard notations for events are capital letters such 
as A, B, CG, etc. 


Outcome (observation) 
A particular result of an experiment. 


Probability 
A number between 0 and 1, inclusive, that gives the likelihood that a 
specific event will occur. The foundation of statistics is given by the 
following 3 axioms (by A. N. Kolmogorov, 1930’s): Let S denote the 
sample space and A and B are two events in S'. Then: 


* 0 < P(A) <1; 

e If A and B are any two mutually exclusive events, then 
P(Aor B) = P(A) + P(B). 

* P(S)=1. 


Sample Space 
The set of all possible outcomes of an experiment. 


Independent and Mutually Exclusive Events 

Probability: Independent and Mutually Exclusive Events is part of the 
collection col10555 written by Barbara Illowsky and Susan Dean and 
explains the concept of independent events, where the probability of event 
A does not have any effect on the probability of event B, and mutually 
exclusive events, where events A and B cannot occur at the same time. The 
module has contributions from Roberta Bloom. 


Independent and mutually exclusive do not mean the same thing. 


Independent Events 


Two events are independent if the following are true: 


» P(A|B) = P(A) 


Two events A and B are independent if the knowledge that one occurred 
does not affect the chance the other occurs. For example, the outcomes of 
two roles of a fair die are independent events. The outcome of the first roll 
does not change the probability for the outcome of the second roll. To show 
two events are independent, you must show only one of the above 
conditions. If two events are NOT independent, then we say that they are 
dependent. 


Sampling may be done with replacement or without replacement. 


¢ With replacement: If each member of a population is replaced after it 
is picked, then that member has the possibility of being chosen more 
than once. When sampling is done with replacement, then events are 
considered to be independent, meaning the result of the first pick will 
not change the probabilities for the second pick. 

e Without replacement:: When sampling is done without replacement, 
then each member of a population may be chosen only once. In this 
case, the probabilities for the second pick are affected by the result of 


the first pick. The events are considered to be dependent or not 
independent. 


If it is not known whether A and B are independent or dependent, assume 
they are dependent until you can show otherwise. 


Mutually Exclusive Events 


A and B are mutually exclusive events if they cannot occur at the same 
time. This means that A and B do not share any outcomes and 
P(A AND B) = 0. 


For example, suppose the sample space S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. 
Let A = {1, 2,3, 4,5}, B = {4, 5, 6, 7, 8}, and C = {7, 9}. 

A AND B = {4,5}. P(A AND B) = and is not equal to zero. 
Therefore, A and B are not mutually exclusive. A and C do not have any 
numbers in common so P(A AND C) = 0. Therefore, A and Care 
mutually exclusive. 


If it is not known whether A and B are mutually exclusive, assume they 
are not until you can show otherwise. 


The following examples illustrate these definitions and terms. 


Example: 

Flip two fair coins. (This is an experiment.) 

The sample space is {HH, HT, TH, TT} where T = tails and H = heads. 
The outcomes are HH, HT, TH, and TT. The outcomes HT and TH are 
different. The HT means that the first coin showed heads and the second 
coin showed tails. The ‘TH means that the first coin showed tails and the 
second coin showed heads. 


e Let A = the event of getting at most one tail. (At most one tail means 
0 or 1 tail.) Then A can be written as {HH, HT, TH}. The outcome 
HH shows 0 tails. HT and TH each show 1 tail. 


¢ Let B= the event of getting all tails. B can be written as {TT}. B is 
the complement of A. So, B = A’. Also, 
P(A) + P(B) = P(A) + P(A’) = 1. 

e The probabilities for A and for B are P(A) = + and P(B) = i. 

e Let C = the event of getting all heads. C = {HH}. Since B = {TT}, 
P(B AND C) = 0. Band C are mutually exclusive. (B and C have 
no members in common because you cannot have all tails and all 
heads at the same time.) 

e Let D = event of getting more than one tail. D = {TT}. P(D) = + 

¢ Let & = event of getting a head on the first roll. (This implies you can 
get either a head or tail on the second roll.) FE = {HT, HH}. 
Ph). 

e Find the probability of getting at least one (1 or 2) tail in two flips. 
Let F' = event of getting at least one tail in two flips. 


Faleeen (EUR AUS (Absit 20a) 


Example: 

Roll one fair 6-sided die. The sample space is {1, 2, 3, 4, 5, 6}. Let event 
A =a face is odd. Then A = {1, 3, 5}. Let event B =a face is even. Then 
B = {2, 4, 6}. 


e Find the complement of A, A’. The complement of A, A’, is B 
because A and B together make up the sample space. 
P(A) + P(B) = P(A) + P(A’) = 1. Also, P(A) = 3 and 
P(B) = ¢ 

e Let event C = odd faces larger than 2. Then C' = {3, 5}. Let event D 
= all even faces smaller than 5. Then D = {2,4}. P(C and D) = 0 
because you cannot have an odd and even face at the same time. 
Therefore, C' and D are mutually exclusive events. 

e Let event F = all faces less than 5. H = {1, 2, 3, 4}. 
Exercise: 


Problem: 


Are Cand E mutually exclusive events? (Answer yes or no.) 
Why or why not? 


Solution: 


No. C = {3, 5} and E = {1, 2, 3, 4}. P(C AND E) = <. Tobe 
mutually exclusive, P(C AND E) must be 0. 


e Find P(C|A). This is a conditional. Recall that the event C’ is {3, 5} 
and event A is {1, 3, 5}. To find P(C|A), find the probability of C 
using the sample space A. You have reduced the sample space from 
the original sample space {1, 2, 3, 4, 5, 6} to {1, 3, 5}. So, 

P(C|A) = 4 


Example: 

Let event G = taking a math class. Let event H = taking a science class. 
Then, G AND H = taking a math class and a science class. Suppose 
P(G) = 0.6, P(H) = 0.5, and P(G AND H) = 0.3. Are G and H 
independent? 

If G and A are independent, then you must show ONE of the following: 


» P(G|H) = P(G) 


Note:The choice you make depends on the information you have. You 
could choose any of the methods here because you have the necessary 
information. 


Exercise: 


Problem: Show that P(G|H) = P(G). 


Solution: 


P(G AND H 
P(G|H) = “Sa = 38 = 0.6 = PG) 


Exercise: 
Problem: Show P(G AND H) = P(G)- P(H). 


Solution: 


P(G)-P(H) = 0.6-0.5 = 0.3 = P(G AND H) 


Since G and H are independent, then, knowing that a person is taking a 
science class does not change the chance that he/she is taking math. If the 
two events had not been independent (that is, they are dependent) then 
knowing that a person is taking a science class would change the chance 
he/she is taking math. For practice, show that P(H|G) = P(H) to show 
that G and H are independent events. 


Example: 

In a box there are 3 red cards and 5 blue cards. The red cards are marked 
with the numbers 1, 2, and 3, and the blue cards are marked with the 
numbers 1, 2, 3, 4, and 5. The cards are well-shuffled. You reach into the 
box (you cannot see into it) and draw one card. 

Let R = red card is drawn, B = blue card is drawn, & = even-numbered 
card is drawn. 

The sample space S = R1, R2, R3, B1, B2, B3, B4, B5. S has 8 
outcomes. 


e2B Ch 3. P(B) = 2. P(R AND B) = 0. (You cannot draw one 
card that is both red and blue.) 


e P(E) = 3. (There are 3 even-numbered cards, R2, B2, and B4.) 

¢ P(E|B) = 2. (There are 5 blue cards: B1, B2, B3, B4, and Bd. Out 
of the blue cards, there are 2 even cards: B2 and B4.) 

¢ P(B|E) = &. (There are 3 even-numbered cards: R2, B2, and B4. 
Out of the even-numbered cards, 2 are blue: B2 and B4.) 

e The events R and B are mutually exclusive because 
P{R.AND B) = 0: 

e Let G = card with a number greater than 3. G = {B4, B5}. 
B(G) = 2. Let = blue card numbered between 1 and 4, inclusive. 
H = {B1, B2, B3, B4}. P(G|H) = 4. (The only card in H that has 
a number greater than 3 is B4.) Since = +, P(G) = P(G|H) 
which means that G and H are independent. 


Example: 

In a particular college class, 60% of the students are female. 50 % of all 
students in the class have long hair. 45% of the students are female and 
have long hair. Of the female students, 75% have long hair. Let F be the 
event that the student is female. Let L be the event that the student has long 
hair. One student is picked randomly. Are the events of being female and 
having long hair independent? 


e The following probabilities are given in this example: 
eR (R= 060 P(e) 40- 50 

e P(F AND L) = 0.45 

e P(L|F) = 0.75 


Note:The choice you make depends on the information you have. You 
could use the first or last condition on the list for this example. You do not 
know P(F|L) yet, so you can not use the second condition. 


Solution 1 

Check whether P(F and L) = P(F)P(L): We are given that P(F and L) = 0.45 
; but P(F)P(L) = (0.60)(0.50)= 0.30 The events of being female and having 
long hair are not independent because P(F and L) does not equal P(F)P(L). 
Solution 2 

check whether P(L|F) equals P(L): We are given that P(L|F) = 0.75 but 
P(L) = 0.50; they are not equal. The events of being female and having 
long hair are not independent. 

Interpretation of Results 

The events of being female and having long hair are not independent; 
knowing that a student is female changes the probability that a student has 
long hair. 


**Example 5 contributed by Roberta Bloom 


Glossary 


Independent Events 
The occurrence of one event has no effect on the probability of the 
occurrence of any other event. Events A and B are independent if one 
of the following is true: (1). P(A|B) = P(A); (2) P(B|A) = P(B); 
(3) P(Aand B) = P(A)P(B). 


Mutually Exclusive 
An observation cannot fall into more than one class (category). Being 
in more than one category prevents being in a mutually exclusive 
category. 


Two Basic Rules of Probability 
This module introduces the multiplication and addition rules used when calculating 
probabilities. 


The Multiplication Rule 


If A and B are two events defined on a sample space, then: 
P(A AND B) = P(B)- P(A|B). 


This rule may also be written as : P(A|B)= ee ae 
(The probability of A given B equals the probability of A and B divided by the 
probability of B.) 


If A and B are independent, then P(A|B) = P(A). Then 
P(A AND B) = P(A|B) P(B) becomes P(A AND B) = P(A) P(B). 


The Addition Rule 


If A and B are defined on a sample space, then: 
P(A OR B) = P(A) + P(B) — P(A AND B). 


If A and B are mutually exclusive, then P(A AND B) = 0. Then 
P(A OR B) = P(A) + P(B) — P(A AND B) becomes 
P(A OR B) = P(A) + P(B). 


Example: 
Klaus is trying to choose where to go on vacation. His two choices are: A = New 
Zealand and B = Alaska 


¢ Klaus can only afford one vacation. The probability that he chooses A is 
P(A) = 0.6 and the probability that he chooses B is P(B) = 0.35. 

e P(A and B) = 0 because Klaus can only afford to take one vacation 

e Therefore, the probability that he chooses either New Zealand or Alaska is 
P(A OR B) = P(A) + P(B) = 0.6 + 0.35 = 0.95. Note that the probability 
that he does not choose to go anywhere on vacation must be 0.05. 


Example: 


Carlos plays college soccer. He makes a goal 65% of the time he shoots. Carlos is going 
to attempt two goals in a row in the next game. 

A = the event Carlos is successful on his first attempt. P(A) = 0.65. B = the event 
Carlos is successful on his second attempt. P(B) = 0.65. Carlos tends to shoot in 
streaks. The probability that he makes the second goal GIVEN that he made the first 
goal is 0.90. 

Exercise: 


Problem: What is the probability that he makes both goals? 


Solution: 


The problem is asking you to find P(A AND B) = P(B AND A). Since 
P(B|A) = 0.90: 
Equation: 


P(B AND A) = P(BJA) P(A) = 0.90*0.65 = 0.585 


Carlos makes the first and second goals with probability 0.585. 
Exercise: 


Problem: 
What is the probability that Carlos makes either the first goal or the second goal? 
Solution: 


The problem is asking you to find P(A OR B). 
Equation: 


P(A ORB) = P(A) + P(B) — P(A AND B) = 0.65 + 0.65 — 0.585 = 0.715 


Carlos makes either the first goal or the second goal with probability 0.715. 
Exercise: 


Problem: Are A and B independent? 


Solution: 


No, they are not, because P(B AND A) = 0.585. 
Equation: 


P(B) - P(A) = (0.65) - (0.65) = 0.423 
Equation: 


0.423 4 0.585 = P(B AND A) 


So, P(B AND A) is not equal to P(B) - P(A). 


Exercise: 


Problem: Are A and B mutually exclusive? 


Solution: 
No, they are not because P(A and B) = 0.585. 


To be mutually exclusive, P(A AND B) must equal 0. 


Example: 

A community swim team has 150 members. Seventy-five of the members are advanced 
swimmers. Forty-seven of the members are intermediate swimmers. The remainder are 
novice swimmers. Forty of the advanced swimmers practice 4 times a week. Thirty of 
the intermediate swimmers practice 4 times a week. Ten of the novice swimmers 
practice 4 times a week. Suppose one member of the swim team is randomly chosen. 
Answer the questions (Verify the answers): 

Exercise: 


Problem: What is the probability that the member is a novice swimmer? 


Solution: 


28 
150 


Exercise: 


Problem: What is the probability that the member practices 4 times a week? 


Solution: 


80 
150 


Exercise: 
Problem: 


What is the probability that the member is an advanced swimmer and practices 4 
times a week? 


Solution: 


40 
150 


Exercise: 
Problem: 
What is the probability that a member is an advanced swimmer and an 


intermediate swimmer? Are being an advanced swimmer and an intermediate 
swimmer mutually exclusive? Why or why not? 


Solution: 


P(advanced AND intermediate) = 0, so these are mutually exclusive events. 
A swimmer cannot be an advanced swimmer and an intermediate swimmer at the 
same time. 


Exercise: 


Problem: 


Are being a novice swimmer and practicing 4 times a week independent events? 
Why or why not? 


Solution: 


No, these are not independent events. 


Equation: 

P(novice AND practices 4 times per week) = 0.0667 
Equation: 

P(novice) - P(practices 4 times per week) = 0.0996 
Equation: 


0.0667 ¢ 0.0996 


Example: 

Studies show that, if she lives to be 90, about 1 woman in 7 (approximately 14.3%) will 
develop breast cancer. Suppose that of those women who develop breast cancer, a test is 
negative 2% of the time. Also suppose that in the general population of women, the test 
for breast cancer is negative about 85% of the time. Let B = woman develops breast 
cancer and let NV = tests negative. Suppose one woman is selected at random. 

Exercise: 


Problem: 


What is the probability that the woman develops breast cancer? What is the 
probability that woman tests negative? 


Solution: 


P= 0143 SPN) 0:85 
Exercise: 


Problem: 


Given that the woman has breast cancer, what is the probability that she tests 
negative? 


Solution: 


P(N|B) = 0.02 
Exercise: 


Problem: 
What is the probability that the woman has breast cancer AND tests negative? 
Solution: 


Pe AND N)— PB) PNB) = (0143) = (0:02) — 00029 
Exercise: 


Problem: 
What is the probability that the woman has breast cancer or tests negative? 


Solution: 


P(B ORN) = P(B) + P(N) — P(B AND N) = 0.143 + 0.85 — 0.0029 = 0.9901 


Exercise: 


Problem: Are having breast cancer and testing negative independent events? 


Solution: 


No. P(N) = 0.85; P(N|B) = 0.02. So, P(N|B) does not equal P(N) 


Exercise: 


Problem: Are having breast cancer and testing negative mutually exclusive? 


Solution: 


No. P(B AND N) = 0.0029. For B and N to be mutually exclusive, 
P(B AND N) must be 0. 


Glossary 


Independent Events 
The occurrence of one event has no effect on the probability of the occurrence of 
any other event. Events A and B are independent if one of the following is true: (1). 
P(A|B) = P(A); (2) P(B|A) = P(B); (3) P(Aand B) = P(A)P(B). 


Mutually Exclusive 
An observation cannot fall into more than one class (category). Being in more than 
one category prevents being in a mutually exclusive category. 


Sample Space 
The set of all possible outcomes of an experiment. 


Contingency Tables 
This module introduces the contingency table as a way of determining conditional 
probabilities. 


A contingency table provides a way of portraying data that can facilitate calculating 
probabilities. The table helps in determining conditional probabilities quite easily. The 
table displays sample values in relation to two different variables that may be dependent 
or contingent on one another. Later on, we will use contingency tables again, but in 
another manner. Contingincy tables provide a way of portraying data that can facilitate 
calculating probabilities. 


Example: 
Suppose a study of speeding violations and drivers who use car phones produced the 
following fictional data: 


Speeding violation in No speeding violation in 

the last year the last year Total 
Cappel 25 280 305 
user 
een ces 45 405 450 
phone user 
Total 70 685 755 


The total number of people in the sample is 755. The row totals are 305 and 450. The 
column totals are 70 and 685. Notice that 305 + 450 = 755 and 70 + 685 = 755. 
Calculate the following probabilities using the table 

Exercise: 


Problem: P(person is a car phone user) = 


Solution: 


number of car phone users __ 305 
total number in study — 755 


Exercise: 


Problem: P(person had no violation in the last year) = 


Solution: 


number that had no violation __ 685 
total number in study 55 


Exercise: 


Problem: 
P(person had no violation in the last year AND was a car phone user) = 


Solution: 


280 
795 


Exercise: 


Problem: 


P(person is a car phone user OR person had no violation in the last year) = 


Solution: 

(355 +755) “755 = 55 
Exercise: 

Problem: 


P(person is a car phone user GIVEN person had a violation in the last year) = 
Solution: 


2. (The sample space is reduced to the number of persons who had a violation.) 
Exercise: 


Problem: 
P(person had no violation last year GIVEN person was not a car phone user) = 


Solution: 


== (The sample space is reduced to the number of persons who were not car phone 


users. ) 


Example: 
The following table shows a random sample of 100 hikers and the areas of hiking 
preferred: 


The Near Lakes and On Mountain 
Sex Coastline Streams Peaks Total 
Female 18 16 ae: 45 
Male — — 14 55 
Total = 41 


Hiking Area Preference 


Exercise: 


Problem: Complete the table. 


Solution: 
The Near Lakes and On Mountain 
Sex Coastline Streams Peaks Total 
Female 18 16 11 45 
Male 16 25 14 55 
Total 34 41 25 100 


Hiking Area Preference 


Exercise: 


Problem: 
Are the events "being female" and "preferring the coastline" independent events? 
Let F' = being female and let C’' = preferring the coastline. 


* aP(F AND C) = 
¢ bP(F) - P(C) = 


Are these two numbers the same? If they are, then F and C’ are independent. If they 
are not, then F’ and C are not independent. 


Solution: 
¢ aP(F AND C) = == =0.18 
¢ bP(F)-P(C) = > - 4a = 0.45 -0.34 = 0.153 


P(F AND C) # P(F) - P(C), so the events F' and C are not independent. 
Exercise: 

Problem: 

Find the probability that a person is male given that the person prefers hiking near 


lakes and streams. Let M = being male and let L = prefers hiking near lakes and 
streams. 


e¢ aWhat word tells you this is a conditional? 
¢ bFill in the blanks and calculate the probability: P(__|__) = 
e cls the sample space for this problem all 100 hikers? If not, what is it? 


Solution: 
e aThe word 'given' tells you that this is a conditional. 
e bP(M|L) = 2 
e cNo, the sample space for this problem is 41. 


Exercise: 


Problem: 


Find the probability that a person is female or prefers hiking on mountain peaks. 
Let F = being female and let P = prefers mountain peaks. 


aP(F) = 
3 DE CE = 
¢ cP(F AND P) = 
e dTherefore, P(F OR P) = 


Solution: 
_ 45 
e aP(F) = Ry 
¢ bP(P) = Fae 
e cP(F AND P) = a 
° dP(FORP)= 3 +4-#=2 
Example: 


Muddy Mouse lives in a cage with 3 doors. If Muddy goes out the first door, the 
probability that he gets caught by Alissa the cat is ~ and the probability he is not caught 
is 2. If he goes out the second door, the probability he gets caught by Alissa is - and 
the probability he is not caught is 3. The probability that Alissa catches Muddy coming 
out of the third door is - and the probability she does not catch Muddy is S$. It is 


equally likely that Muddy will choose any of the three doors so the probability of 
choosing each door is = 


Caught or Not Door One Door Two Door Three Total 
1 1 i 

Caught TH Tol ‘ei 

Not Caught “= — 4 


Caught or Not Door One Door Two Door Three 


Total 


Door Choice 


° The first entry == = (+) (4) is P(Door One AND Caught). 
¢ The entry = = () SS is P(Door One AND Not Caught). 


Verify the remaining entries. 
Exercise: 


Problem: 


Total 


Complete the probability contingency table. Calculate the entries for the totals. 


Verify that the lower-right corner entry is 1. 


Solution: 


Caught or Not Door One Door Two Door Three 


1 1 il 
Cau ght 15 TD ras 
Not Caught = a 4 
Total aa ~ = 


Door Choice 


Exercise: 


Problem: What is the probability that Alissa does not catch Muddy? 


Solution: 


AL 
60 


Exercise: 


Total 


19 
60 


AL 
60 


1 


Problem: 


What is the probability that Muddy chooses Door One OR Door Two given that 
Muddy is caught by Alissa? 


Solution: 


=e 
19 


Note: You could also do this problem by using a probability tree. See the Tree Diagrams 
(Optional) section of this chapter for examples. 


Glossary 


Contingency Table 
The method of displaying a frequency distribution as a table with rows and columns 
to show how two variables may be dependent (contingent) upon each other. The 
table provides an easy way to calculate conditional probabilities. 


Venn Diagrams (optional) 

This module introduces Venn diagrams as a method for solving some probability 
problems. This module is included in the Elementary Statistics textbook/collection as an 
optional lesson. 


A Venn diagram is a picture that represents the outcomes of an experiment. It generally 
consists of a box that represents the sample space S together with circles or ovals. The 
circles or ovals represent events. 


Example: 

Suppose an experiment has the outcomes 1, 2, 3, ... , 12 where each outcome has an 
equal chance of occurring. Let event A = {1, 2, 3, 4, 5, 6} and event B = {6, 7, 8, 9}. 
Then A AND B = {6} and A ORB = (1, 2, 3, 4, 5, 6, 7, 8, 9}. The Venn diagram is 
as follows: 


Example: 

Flip 2 fair coins. Let A = tails on the first coin. Let B = tails on the second coin. Then 
A = {TT, TH} and B = {TT, HT}. Therefore, A AND B = {TT}. 

AOR Ss == sb Haan Han 

The sample space when you flip two fair coins is S = {HH, HT, TH, TT}. The 
outcome HH is in neither A nor B. The Venn diagram is as follows: 


S 


Example: 

Forty percent of the students at a local college belong to a club and 50% work part 
time. Five percent of the students work part time and belong to a club. Draw a Venn 
diagram showing the relationships. Let C’' = student belongs to a club and PT = student 
works part time. 


S 
C and PT 


_ «s 


If a student is selected at random find 


The probability that the student belongs to a club. P(C) = 0.40. 

The probability that the student works part time. P(PT) = 0.50. 

The probability that the student belongs to a club AND works part time. 

P(C AND PT) = 0.05. 

The probability that the student belongs to a club given that the student works part 
time. 

Equation: 


P(CIPT) = P(CANDPT) _ 0.05 _ 9, 
7 P(PT) — 0.50 °° 


The probability that the student belongs to a club OR works part time. 
Equation: 


P(C OR PT) = P(C) + P(PT) — P(C AND PT) = 0.40 + 0.50 — 0.05 = 0.85 


Glossary 


Venn Diagram 


The visual representation of a sample space and events in the form of circles or ovals 
showing their intersections. 


Tree Diagrams (optional) 

This module introduces tree diagrams as a method for making some probability 
problems easier to solve. This module is included in the Elementary Statistics 
textbook/collection as an optional lesson. 


A tree diagram is a special type of graph used to determine the outcomes of an 
experiment. It consists of "branches" that are labeled with either frequencies or 
probabilities. Tree diagrams can make some probability problems easier to visualize 
and solve. The following example illustrates how to use a tree diagram. 


Example: 

In an urn, there are 11 balls. Three balls are red (A) and 8 balls are blue (B). Draw 
two balls, one at a time, with replacement. "With replacement" means that you put 
the first ball back in the urn before you select the second ball. The tree diagram using 
frequencies that show all the possible outcomes follows. 


24BR 24RB 


Total = 64+ 24+ 244+ 9= 121 


The first set of branches represents the first draw. The second set of branches 
represents the second draw. Each of the outcomes is distinct. In fact, we can list each 
red ball as R1, R2, and R3 and each blue ball as B1, B2, B3, B4, B5, B6, B7, and 
B8. Then the 9 RR outcomes can be written as: 

R1R1 R1R2 R1R3 R2R1 R2R2 R2R3 R3R1 R3R2 R3R3 

The other outcomes are similar. 

There are a total of 11 balls in the urn. Draw two balls, one at a time, and with 
replacement. There are 11 - 11 = 121 outcomes, the size of the sample space. 
Exercise: 


Problem: List the 24 BR outcomes: B1R1, B1R2, B1R3, ... 
Solution: 
B1R1 B1R2 B1R3 B2R1 B2R2 B2R3 B3R1 B3R2 B3R3 B4R1 B4R2 


B4R3 B5R1 B5R2 B5R3 B6R1 B6R2 B6R3 B7R1 B7R2 B7R3 B&R1 
B8R2 B8&R3 


Exercise: 


Problem: Using the tree diagram, calculate P(RR). 
Solution: 


9 33 Oo — 


Exercise: 


Problem: Using the tree diagram, calculate P(RB OR BR). 


Solution: 


P(RBORBR)=2-4+2-4=2 
Exercise: 


Problem: 


Using the tree diagram, calculate P(R on 1st draw AND B on 2nd draw). 


Solution: 


P(R on 1st draw AND B on 2nd draw) = P(RB) = 4-4 = 


Exercise: 


Problem: 
Using the tree diagram, calculate P(R on 2nd draw given B on 1st draw). 
Solution: 


P(R on 2nd draw given B on Ist draw) = P(R on 2nd | Bon Ist) = 4 = 4 


This problem is a conditional. The sample space has been reduced to those 
outcomes that already have a blue on the first draw. There are 24 + 64 = 88 
possible outcomes Ce BR and 64 BB). Twenty-four of the 88 possible 


outcomes are BR. = = ea 


Exercise: 


Problem: Using the tree diagram, calculate P(BB). 


Solution: 

P(BB) = = 
Exercise: 

Problem: 


Using the tree diagram, calculate 
P(B on the 2nd draw given R on the first draw). 


Solution: 
P(B on 2nd draw | R on Ist draw) = -- 


There are 9 + 24 outcomes that have R on the first draw (9 RR and 24 RB). 
The sample space is then9 + 24 = 33. Twenty-four of the 33 outcomes have 
B on the second draw. The probability is then _ 


Example: 
An urn has 3 red marbles and 8 blue marbles in it. Draw two marbles, one at a time, 
this time without replacement from the urn. "Without replacement" means that you 
do not put the first ball back before you select the second ball. Below is a tree 
diagram. The branches are labeled with probabilities instead of frequencies. The 
numbers at the ends of the branches are ae by maui the numbers on the 
two corresponding branches, for example 


oa =o 


1* Draw 


56 24 24 6 
110 110 110 110 
BB BR RB RR 


Total — 26t24+24+6 _ 110 _ 4 


110 110 


Note:If you draw a red on the first draw from the 3 red possibilities, there are 2 red 
left to draw on the second draw. You do not put back or replace the first ball after 
you have drawn it. You draw without replacement, so that on the second draw there 
are 10 marbles left in the urn. 


Calculate the following probabilities using the tree diagram. 
Exercise: 


Problem: P(RR) = 


Solution: 
3 2 6 
P(RR) = 3 - ig = iw 
Exercise: 
Problem: Fill in the blanks: 


ELSE OUR IB) sec ape te (ae 


Solution: 


P(RB or BR) = 2-84 (£)(2)= 4 


Exercise: 


Problem: P(R on 2d | B on 1st) = 

Solution: 

P(R on 2d|Bonist) = 4 
Exercise: 

Problem: Fill in the blanks: 


P(R on 1st and B on 2nd) = P(RB) = (__)(__) = + 


110 
Solution: 
P eR onelstandb on 2nd). — F(R) (3) (4) = 2. 
Exercise: 
Problem: P(BB) = 
Solution: 


P(BB) = = - + 


Exercise: 


Problem: P(B on 2nd | R on Ist) = 
Solution: 


There are 6 + 24 outcomes that have R on the first draw (6 RR and 24 RB). 


The 6 and the 24 are frequencies. They are also the numerators of the fractions 
acm and “. The sample space is no longer 110 but6 + 24 = 30. Twenty- 
four of the 30 outcomes have B on the second draw. The probability is then a. 


Did you get this answer? 


If we are using probabilities, we can label the tree in the following general way. 


PR and R)= P(RR) 


R on 2nd | R on 1st 
B on 2nd | R on 1st 
R on 2nd | B on 1st 
B on 2nd | B on 1st 


SS” 


) here means P 
) here means P 
) here means P 
) here means P 


EEN oN eae aa 
~=_—"{$" ~~” 


Glossary 


Sample Space 
The set of all possible outcomes of an experiment. 


Tree Diagram 
The useful visual representation of a sample space and events in the form of a 
“tree” with branches marked by possible outcomes simultaneously with 
associated probabilities (frequencies, relative frequencies). 


Summary of Formulas 

This module provides a review of the probability formulas, including the 
definitions of independent, complementary, and mutually exclusive events 
as well as the addition and multiplication rules. 

Formula 

Complement 


If A and A’ are complements then P(A) + P(A’) = 1 
Formula 
Addition Rule 


P(A OR B) = P(A) + P(B) — P(A AND B) 
Formula 
Mutually Exclusive 


If A and B are mutually exclusive then P(A AND B) = 0; so 
P(A ORB) = P(A) + P(B). 

Formula 

Multiplication Rule 


« P(A AND B) = P(B)P(A[B) 
¢ P(A AND B) = P(A)P(BJA) 


Formula 
Independence 


If A and B are independent then: 


* P(A|B) = P(A) 
* P(BJA) = P(B) 
¢ P(A AND B) = P(A)P(B) 


Practice 1: Contingency Tables 

This module provides the opportunity for students to apply what they've learned about probability to 
solve a series of problems given a set of data. Students will practice constructing and interpreting 
contingency tables. 


Student Learning Outcomes 


e The student will construct and interpret contingency tables. 


Given 

An article in the New England Journal of Medicine , reported about a study of smokers in California 
and Hawaii. In one part of the report, the self-reported ethnicity and smoking levels per day were 
given. Of the people smoking at most 10 cigarettes per day, there were 9886 African Americans, 2745 
Native Hawaiians, 12,831 Latinos, 8378 Japanese Americans, and 7650 Whites. Of the people 
smoking 11-20 cigarettes per day, there were 6514 African Americans, 3062 Native Hawaiians, 4932 
Latinos, 10,680 Japanese Americans, and 9877 Whites. Of the people smoking 21-30 cigarettes per 
day, there were 1671 African Americans, 1419 Native Hawaiians, 1406 Latinos, 4715 Japanese 
Americans, and 6062 Whites. Of the people smoking at least 31 cigarettes per day, there were 759 


African Americans, 788 Native Hawaiians, 800 Latinos, 2305 Japanese Americans, and 3970 Whites. 
((Source: http://www.nejm.org/doi/full/10.1056/NEJMoa033250)) 


Complete the Table 


Complete the table below using the data provided. 


Smoking African Native Japanese 
Level American Hawaiian Latino Americans White TOTALS 


1-10 
11-20 
21-30 
31+ 
TOTALS 


Smoking Levels by Ethnicity 


Analyze the Data 


Suppose that one person from the study is randomly selected. 


Exercise: 


Problem: Find the probability that person smoked 11-20 cigarettes per day. 


Solution: 


35,065 
100,450 


Exercise: 


Problem: Find the probability that person was Latino. 


Solution: 


19,969 
100,450 


Discussion Questions 


Exercise: 


Problem: 


In words, explain what it means to pick one person from the study and that person is “Japanese 
American AND smokes 21-30 cigarettes per day.” Also, find the probability. 
Solution: 


4,715 
100,450 


Exercise: 


Problem: 


In words, explain what it means to pick one person from the study and that person is “Japanese 
American OR smokes 21-30 cigarettes per day.” Also, find the probability. 


Solution: 


36,636 
100,450 


Exercise: 


Problem: 


In words, explain what it means to pick one person from the study and that person is “Japanese 
American GIVEN that person smokes 21-30 cigarettes per day.” Also, find the probability. 


Solution: 


4715 
15,273 


Exercise: 


Problem: Prove that smoking level/day and ethnicity are dependent events. 


Practice 2: Calculating Probabilities 

This module allows students to practice using what they've learned about Probability. Students will apply their 
understanding of basic probability terms, calculate probabilities based on the data provided, and determine whether 
events are independent or mutually exclusive. 


Student Learning Outcomes 
e Students will define basic probability terms. 


e Students will calculate probabilities. 
e Students will determine whether two events are mutually exclusive or whether two events are independent. 


Note: Use probability rules to solve the problems below. Show your work. 


Given 

48% of all Californians registered voters prefer life in prison without parole over the death penalty for a person 
convicted of first degree murder. Among Latino California registered voters, 55% prefer life in prison without 
parole over the death penalty for a person convicted of first degree murder. (Source: 
http://field.com/fieldpollonline/subscribers/RIs2393.pdf ). 

37.6% of all Californians are Latino (Source: U.S. Census Bureau). 

In this problem, let: 


C = Californians (registered voters) preferring life in prison without parole over the death penalty f 
e L = Latino Californians 


Suppose that one Californian is randomly selected. 


Analyze the Data 
Exercise: 
Problem: P(C) = 
Solution: 


0.48 


Exercise: 


Problem: P(L) = 


Solution: 


0.376 


Exercise: 
Problem: P(C|L) = 


Solution: 


0.55 


Exercise: 


Problem: In words, what is " C'|L"? 


Exercise: 


Problem: P(L AND C) = 


Solution: 


0.2068 


Exercise: 


Problem: [In words, what is “Z and C”’? 


Exercise: 


Problem: Are L and C’ independent events? Show why or why not. 


Solution: 
No 


Exercise: 


Problem: P(L OR C) = 


Solution: 
0.6492 


Exercise: 


Problem: [In words, what is “Z or C”? 


Exercise: 


Problem: Are LZ and C’' mutually exclusive events? Show why or why not. 


Solution: 


No 


Homework 

Probability: Homework is part of the collection col10555 written by Barbara Illowsky and Susan 
Dean and provides a number of homework exercises related to Probability with contributions 
from Roberta Bloom. 

Exercise: 


Problem: 


Suppose that you have 8 cards. 5 are green and 3 are yellow. The 5 green cards are 
numbered 1, 2, 3, 4, and 5. The 3 yellow cards are numbered 1, 2, and 3. The cards are well 
shuffled. You randomly draw one card. 


e G=card drawn is green 
e K=card drawn is even-numbered 


e aList the sample space. 

* bP(G) = 

¢ cP(G|E) = 

e dP(G AND £) = 

¢ eP(G ORE) = 

e fAre G and F mutually exclusive? Justify your answer numerically. 


Solution: 


e a{G1, G2, G3, G4, G5, Y1, Y2, Y3} 
° b2 

e c2 
e dz 
e es 


e {No 


Exercise: 


Problem: 


Refer to the previous problem. Suppose that this time you randomly draw two cards, one at a 
time, and with replacement. 


¢ G,= first card is green 
e G»= second card is green 


e aDraw a tree diagram of the situation. 
eb P(G, AND G2) = 

¢ c P(at least one green) = 

¢ d P(Gy | Gi) = 


e eAre G2 and G; independent events? Explain why or why not. 


Exercise: 


Problem: 


Refer to the previous problems. Suppose that this time you randomly draw two cards, one at 
a time, and without replacement. 


G = first card is green 
G 9= second card is green 


aDraw a tree diagram of the situation. 

bP(G, AND Ga) = 

cP(at least one green) = 

dP(G2|G1) = 

eAre Gz» and G, independent events? Explain why or why not. 


Solution: 
© b(3)(F) 
© (33) + (s)(F) + (B)) 
e d 4 
e eNo 
Exercise: 


Problem: Roll two fair dice. Each die has 6 faces. 


aList the sample space. 

bLet A be the event that either a 3 or 4 is rolled first, followed by an even number. Find 
P(A). 

cLet B be the event that the sum of the two rolls is at most 7. Find P(B). 

din words, explain what “P(A|B)” represents. Find P(A|B). 

eAre A and B mutually exclusive events? Explain your answer in 1 - 3 complete 
sentences, including numerical justification. 

fAre A and B independent events? Explain your answer in 1 - 3 complete sentences, 
including numerical justification. 


Exercise: 


Problem: 


A special deck of cards has 10 cards. Four are green, three are blue, and three are red. When 
a card is picked, the color of it is recorded. An experiment consists of first picking a card 
and then tossing a coin. 


e aList the sample space. 


¢ bLet A be the event that a blue card is picked first, followed by landing a head on the 
coin toss. Find P(A). 

¢ cLet B be the event that a red or green is picked, followed by landing a head on the 
coin toss. Are the events A and B mutually exclusive? Explain your answer in 1 - 3 
complete sentences, including numerical justification. 

e dLet C’ be the event that a red or blue is picked, followed by landing a head on the coin 
toss. Are the events A and C’ mutually exclusive? Explain your answer in 1 - 3 
complete sentences, including numerical justification. 


Solution: 


¢ a{GH,GT,BH,BT,RH,RT} 


Exercise: 


Problem: An experiment consists of first rolling a die and then tossing a coin: 


e aList the sample space. 

¢ bLet A be the event that either a 3 or 4 is rolled first, followed by landing a head on the 
coin toss. Find P(A). 

¢ cLet B be the event that a number less than 2 is rolled, followed by landing a head on 
the coin toss. Are the events A and B mutually exclusive? Explain your answer in 1 - 3 
complete sentences, including numerical justification. 


Exercise: 


Problem: 


An experiment consists of tossing a nickel, a dime and a quarter. Of interest is the side the 
coin lands on. 


e aList the sample space. 

¢ bLet A be the event that there are at least two tails. Find P(A). 

e cLet B be the event that the first and second tosses land on heads. Are the events A and 
B mutually exclusive? Explain your answer in 1 - 3 complete sentences, including 
justification. 


Solution: 


- a{(HHH),(HHT),(HTH),(HTT),(THH),(THT),(TTH),(TTT)} 
e b< 
e cYes 


Exercise: 


Problem: Consider the following scenario: 


¢ Let P(C) = 0.4 
* Let P(D) = 0.5 
* Let P(C|D) = 0.6 


¢ aFind P(C AND D). 

bAre C’ and D mutually exclusive? Why or why not? 
cAre C and D independent events? Why or why not? 
dFind P(C ORD). 

eFind P(D|C). 


Exercise: 


Problem: E£ and F mutually exclusive events. P(E) = 0.4; P(F) = 0.5. Find P(E | F). 


Solution: 


0 


Exercise: 


Problem: J and K are independent events. P(J | K) = 0.3. Find P(J) . 


Exercise: 


Problem: U and V are mutually exclusive events. P(U) = 0.26; P(V) = 0.37. Find: 


« aP(U AND V) = 
* bP(U| V) = 
 cP(U OR V) = 


Solution: 


e a0 
« bo 
e c0.63 
Exercise: 
Problem: 
Q and R are independent events. P(Q) = 0.4; P(Q AND R) = 0.1. Find P(R). 


Exercise: 


Problem: Y and Z are independent events. 


¢ a Rewrite the basic Addition Rule P(Y OR Z) = P(Y) + P(Z) — P(Y AND Z) 
using the information that Y and Z are independent events. 
¢ b Use the rewritten rule to find P(Z) if P(Y OR Z) = 0.71 and P(Y) = 0.42. 


Solution: 


e b0.5 
Exercise: 


Problem: G and H are mutually exclusive events. P(G) = 0.5; P(H) = 0.3 


¢ aExplain why the following statement MUST be false: P(H | G) = 0.4. 
¢ bFind: P(H OR G). 
e cAre G and H independent or dependent events? Explain in a complete sentence. 


Exercise: 
Problem: 
The following are real data from Santa Clara County, CA. As of a certain time, there had 


been a total of 3059 documented cases of AIDS in the county. They were grouped into the 
following categories (Source: Santa Clara County Public H.D.): 


IV 
Drug Heterosexual 
Homosexual/Bisexual User* Contact Other Totals 
Female 0 70 136 49 
Male 2146 463 60 135 


Totals a 
* includes homosexual/bisexual IV drug users 


Suppose one of the persons with AIDS in Santa Clara County is randomly selected. 
Compute the following: 


¢ a P(person is female) = 


¢ bP(person has a risk factor Heterosexual Contact) = 


¢ cP(person is female OR has a risk factor of IV Drug User) = 
e dP(person is female AND has a risk factor of Homosexual/Bisexual) = 
¢ eP(person is male AND has a risk factor of IV Drug User) = 
¢ fP(female GIVEN person got the disease from heterosexual contact) = 


e gConstruct a Venn Diagram. Make one group females and the other group heterosexual 


contact. 


Solution: 


The completed contingency table is as follows: 


IV 
Drug 
Homosexual/Bisexual User* 
Female 0 70 
Male 2146 463 
Totals 2146 533 


* includes homosexual/bisexual IV drug users 


e 
aa 
a 
w 
aD 


Exercise: 


Problem: 


Heterosexual 
Contact 


136 


60 


196 


Other 


49 


135 


184 


Totals 


255 


2804 


3059 


Solve these questions using probability rules. Do NOT use the contingency table above. 
3059 cases of AIDS had been reported in Santa Clara County, CA, through a certain date. 


Those cases will be our population. Of those cases, 6.4% obtained the disease through 


heterosexual contact and 7.4% are female. Out of the females with the disease, 53.3% got 


the disease from heterosexual contact. 


* aP(person is female) = 

¢ bP(person obtained the disease through heterosexual contact) = 

¢ cP(female GIVEN person got the disease from heterosexual contact) = 

e dConstruct a Venn Diagram. Make one group females and the other group heterosexual 
contact. Fill in all values as probabilities. 


Exercise: 


Problem: 


The following table identifies a group of children by one of four hair colors, and by type of 
hair. 


Hair Type Brown Blond Black Red Totals 
Wavy 20 15 3 43 
Straight 80 15 12 

Totals 20 215 


¢ aComplete the table above. 

e¢ bWhat is the probability that a randomly selected child will have wavy hair? 

e cWhat is the probability that a randomly selected child will have either brown or blond 
hair? 

e¢ dWhat is the probability that a randomly selected child will have wavy brown hair? 

e eWhat is the probability that a randomly selected child will have red hair, given that he 
has straight hair? 

e fif B is the event of a child having brown hair, find the probability of the complement 
of B. 

e gin words, what does the complement of B represent? 


Solution: 


Exercise: 


Problem: 


A previous year, the weights of the members of the San Francisco 49ers and the Dallas 
Cowboys were published in the San Jose Mercury News. The factual data are compiled into 
the following table. 


Shirt# < 210 211-250 251-290 290< 
1-33 21 iS) 0 0 
34-66 6 18 Z 4 
66-99 6 12 22 5 


For the following, suppose that you randomly select one player from the 49ers or Cowboys. 


e aFind the probability that his shirt number is from 1 to 33. 

e bFind the probability that he weighs at most 210 pounds. 

e cFind the probability that his shirt number is from 1 to 33 AND he weighs at most 210 
pounds. 

e dFind the probability that his shirt number is from 1 to 33 OR he weighs at most 210 
pounds. 

e eFind the probability that his shirt number is from 1 to 33 GIVEN that he weighs at 
most 210 pounds. 

e flf having a shirt number from 1 to 33 and weighing at most 210 pounds were 
independent events, then what should be true about P(Shirt# 1-33 | < 210 pounds)? 


Exercise: 


Problem: 


Approximately 281,000,000 people over age 5 live in the United States. Of these people, 
55,000,000 speak a language other than English at home. Of those who speak another 
language at home, 62.3% speak Spanish. (Source: 
http://www.census.gov/hhes/socdemo/language/data/acs/ACS-12.pdf) 


Let: F = speak English at home; E’ = speak another language at home; S = speak Spanish; 


Finish each probability statement by matching the correct answer. 


Probability Statements 
a. P(E’) = 

b. P(E) = 

c. P(S and E') = 


d. P(SIE') = 


Solution: 

e aiil 

e bi 

e civ 

e dii 
Exercise: 


Problem: 


Answers 


i. 0.8043 


ii. 0.623 


iii. 0.1957 


iv. 0.1219 


The probability that a male develops some form of cancer in his lifetime is 0.4567 (Source: 
American Cancer Society). The probability that a male has at least one false positive test 
result (meaning the test comes back for cancer when the man does not have it) is 0.51 
(Source: USA Today). Some of the questions below do not have enough information for you 
to answer them. Write “not enough information” for those answers. 


Let: C' = a man develops cancer in his lifetime; P = man has at least one false positive 


e aConstruct a tree diagram of the situation. 


* bP(C) = 
¢ cP(P|C) = 
¢ dP(P|C’ ) = 


e elf atest comes up positive, based upon numerical values, can you assume that man has 


cancer? Justify numerically and explain why or why not. 


Exercise: 


Problem: 


In 1994, the U.S. government held a lottery to issue 55,000 Green Cards (permits for non- 
citizens to work legally in the U.S.). Renate Deutsch, from Germany, was one of 
approximately 6.5 million people who entered this lottery. Let G = won Green Card. 


¢ aWhat was Renate’s chance of winning a Green Card? Write your answer as a 


probability statement. 


e bin the summer of 1994, Renate received a letter stating she was one of 110,000 
finalists chosen. Once the finalists were chosen, assuming that each finalist had an 
equal chance to win, what was Renate’s chance of winning a Green Card? Let 
F = was a finalist. Write your answer as a conditional probability statement. 

e cAre G and F independent or dependent events? Justify your answer numerically and 
also explain why. 

e dAre G and F mutually exclusive events? Justify your answer numerically and also 
explain why. 


Note:P.S. Amazingly, on 2/1/95, Renate learned that she would receive her Green Card -- 
true story! 


Solution: 


¢ a P(G) = 0.008 
e bO.5 

¢ cdependent 

e dNo 


Exercise: 


Problem: 


Three professors at George Washington University did an experiment to determine if 
economists are more selfish than other people. They dropped 64 stamped, addressed 
envelopes with $10 cash in different classrooms on the George Washington campus. 44% 
were returned overall. From the economics classes 56% of the envelopes were returned. 
From the business, psychology, and history classes 31% were returned. (Source: Wall Street 
Journal) 


Let: R = money returned; EF’ = economics classes; O = other classes 


e aWrite a probability statement for the overall percent of money returned. 

e bWrite a probability statement for the percent of money returned out of the economics 
classes. 

e cWrite a probability statement for the percent of money returned out of the other 
classes. 

e dis money being returned independent of the class? Justify your answer numerically 
and explain it. 

e eBased upon this study, do you think that economists are more selfish than other 
people? Explain why or why not. Include numbers to justify your answer. 


Exercise: 


Problem: 


The chart below gives the number of suicides estimated in the U.S. for a recent year by age, 
race (black and white), and sex. We are interested in possible relationships between age, 
race, and sex. We will let suicide victims be our population. (Source: The National Center 
for Health Statistics, U.S. Dept. of Health and Human Services) 


Race and Sex 1-14 15 - 24 25 - 64 over 64 TOTALS 
white, male 210 3360 13,610 22,050 
white, female 80 580 3380 4930 
black, male 10 460 1060 1670 
black, female 0 40 270 330 

all others 

TOTALS 310 4650 18,780 29,760 


Note:Do not include "all others" for parts (f), (g), and (i). 


aFill in the column for the suicides for individuals over age 64. 

bFill in the row for all other races. 

cFind the probability that a randomly selected individual was a white male. 

dFind the probability that a randomly selected individual was a black female. 

eFind the probability that a randomly selected individual was black 

fFind the probability that a randomly selected individual was male. 

gOut of the individuals over age 64, find the probability that a randomly selected 
individual was a black or white male. 

hComparing “Race and Sex” to “Age,” which two groups are mutually exclusive? How 
do you know? 

iAre being male and committing suicide over age 64 independent events? How do you 
know? 


Solution: 


22050 
29760 
330 


29760 
2000 
29760 

f 23720 

ie 

8 6020 
¢ hBlack females and ages 1-14 
e iNo 


The next two questions refer to the following: The percent of licensed U.S. drivers (from a 
recent year) that are female is 48.60. Of the females, 5.03% are age 19 and under; 81.36% are age 
20 - 64; 13.61% are age 65 or over. Of the licensed U.S. male drivers, 5.04% are age 19 and 
under; 81.43% are age 20 - 64; 13.53% are age 65 or over. (Source: Federal Highway 
Administration, U.S. Dept. of Transportation) 

Exercise: 


Problem: Complete the following: 


e aConstruct a table or a tree diagram of the situation. 

¢ bP(driver is female) = 

¢ cP(driver is age 65 or over | driver is female) = 

e dP(driver is age 65 or over AND female) = 

e elIn words, explain the difference between the probabilities in part (c) and part (d). 
¢ fP(driver is age 65 or over) = 


e gAre being age 65 or over and being female mutually exclusive events? How do you 
know 


Exercise: 


Problem: Suppose that 10,000 U.S. licensed drivers are randomly selected. 


e aHow many would you expect to be male? 

e bUsing the table or tree diagram from the previous exercise, construct a contingency 
table of gender versus age group. 

e cUsing the contingency table, find the probability that out of the age 20 - 64 group, a 
randomly selected driver is female. 


Solution: 


e a5140 
e c0.49 


Exercise: 


Problem: 


Approximately 86.5% of Americans commute to work by car, truck or van. Out of that 
group, 84.6% drive alone and 15.4% drive in a carpool. Approximately 3.9% walk to work 
and approximately 5.3% take public transportation. (Source: Bureau of the Census, U.S. 
Dept. of Commerce. Disregard rounding approximations.) 


e aConstruct a table or a tree diagram of the situation. Include a branch for all other 
modes of transportation to work. 

¢ bAssuming that the walkers walk alone, what percent of all commuters travel alone to 
work? 

¢ cSuppose that 1000 workers are randomly selected. How many would you expect to 
travel alone to work? 

e dSuppose that 1000 workers are randomly selected. How many would you expect to 
drive in a carpool? 


Exercise: 


Problem: Explain what is wrong with the following statements. Use complete sentences. 


¢ alf there’s a 60% chance of rain on Saturday and a 70% chance of rain on Sunday, then 
there’s a 130% chance of rain over the weekend. 

e bThe probability that a baseball player hits a home run is greater than the probability 
that he gets a successful hit. 


Try these multiple choice questions. 


The next two questions refer to the following probability tree diagram which shows tossing 
an unfair coin FOLLOWED BY drawing one bead from a cup containing 3 red (R), 4 yellow (Y 
) and 5 blue (B) beads. For the coin, P(H) = 2 and P(T') = + where H = ”heads” and 

T=" tails”. 


R 3/12 
Yan 
2/3 
H SB sn2 
R 
: 3/12 
1/3 
Y 4/12 


Exercise: 


Problem: Find P(tossing a Head on the coin AND a Red bead) 


e A 
° B 
°C 
e D 


w w loreal Wb 
slagjaalee 


Solution: 


C 


Exercise: 


Problem: Find P(Blue bead). 


eA 
- Be 
ec 2 
ops 


Solution: 
A 
The next three questions refer to the following table of data obtained from www.baseball- 


almanac.com showing hit information for 4 well known baseball players. Suppose that one hit 
from the table is randomly selected. 


NAME Single Double Triple Home Run TOTAL HITS 
Babe Ruth 1517 506 136 714 2873 
Jackie Robinson 1054 273 54 137 1518 
Ty Cobb 3603 174 295 114 4189 


Hank Aaron 2294 624 98 759 3771 


NAME Single Double Triple 


TOTAL 8471 1577 583 
Exercise: 


Problem: Find P(hit was made by Babe Ruth). 


° A ian 
oe 
© D a5357 


Solution: 


B 


Exercise: 


Home Run TOTAL HITS 


1720 12351 


Problem: Find P(hit was made by Ty Cobb | The hit was a Home Run) 


Solution: 


B 
Exercise: 


Problem: 


Are the hit being made by Hank Aaron and the hit being a double independent 


events? 


¢ A Yes, because P(hit by Hank Aaron | hit is a double) = P(hit by Hank Aaron) 
¢ BNo, because P(hit by Hank Aaron | hit is a double) 4 P(hit is a double) 


e C No, because 


P(hit is by Hank Aaron | hit is a double) ¢ P(hit by Hank Aaron) 
¢ D Yes, because P(hit is by Hank Aaron | hit is a double) = P(hit is a double) 


Solution: 


Cc 


Exercise: 


Problem: Given events G and H: P(G) = 0.43 ; P(H) = 0.26 ; P(H and G) = 0.14 


e A Find P(H or G) 
e BFind the probability of the complement of event (H and G) 
e CFind the probability of the complement of event (H or G) 


Solution: 


¢ A P(H orG) = P(H) + P(G) - P(H and G) = 0.26 + 0.43 - 0.14 = 0.55 
¢ BP( NOT (H andG) ) = 1 - P(HandG) = 1-0.14=0.86 
* C P( NOT (HorG)) =1- P(HorG)=1-0.55=0.45 


Exercise: 


Problem: Given events J and K: P(J) = 0.18 ; P(K) = 0.37 ; PJ or K) = 0.45 


e A Find P(J and k) 
e B Find the probability of the complement of event (J and K) 
¢ C Find the probability of the complement of event (J or K) 


Solution: 


e AP(J or K) = P(J) + P(K) - PU and K); 0.45 = 0.18 + 0.37 — PJ and K) ; solve to find 
P(J and K) = 0.10 

¢ BP( NOT (J and K) ) = 1 —- PV and K) = 1 - 0.10 = 0.90 

e C P( NOT (J or K) ) =1- PU or K) =1-0.45=0.55 


Exercise: 


Problem: 


United Blood Services is a blood bank that serves more than 500 hospitals in 18 states. 
According to their website, http://www.unitedbloodservices.org/humanbloodtypes.html, a 
person with type O blood and a negative Rh factor (Rh-) can donate blood to any person 
with any bloodtype. Their data show that 43% of people have type O blood and 15% of 
people have Rh- factor; 52% of people have type O or Rh- factor. 


e A Find the probability that a person has both type O blood and the Rh- factor 
e B Find the probability that a person does NOT have both type O blood and the Rh- 
factor. 


Solution: 


e A P(Type O or Rh-) = P(Type O) + P(Rh-) — P(Type O and Rh-) 
0.52 = 0.43 + 0.15 — P(Type O and Rh-); solve to find P(Type O and Rh-) = 0.06 
6% of people have type O Rh- blood 

¢ BP( NOT (Type O and Rh-) ) = 1 — P(Type O and Rh-) = 1 — 0.06 = 0.94 
94% of people do not have type O Rh- blood 


Exercise: 
Problem: 
At a college, 72% of courses have final exams and 46% of courses require research papers. 


Suppose that 32% of courses have a research paper and a final exam. Let F be the event that 
a course has a final exam. Let R be the event that a course requires a research paper. 


e A Find the probability that a course has a final exam or a research project. 
e B Find the probability that a course has NEITHER of these two requirements. 


Solution: 


e A P(R or F) = P(R) + P(F) — P(R and F) = 0.72 + 0.46 — 0.32 = 0.86 
¢ B P( Neither R nor F ) = 1 - P(Ror F) =1-0.86=0.14 


Exercise: 
Problem: 


In a box of assorted cookies, 36% contain chocolate and 12% contain nuts. Of those, 8% 
contain both chocolate and nuts. Sean is allergic to both chocolate and nuts. 


e A Find the probability that a cookie contains chocolate or nuts (he can't eat it). 
e B Find the probability that a cookie does not contain chocolate or nuts (he can eat it). 


Solution: 


e Let C be the event that the cookie contains chocolate. Let N be the event that the cookie 
contains nuts. 

e A P(C or N) = P(C) + P(N) — P(C and N) = 0.36 + 0.12 — 0.08 = 0.40 

¢ B P( neither chocolate nor nuts) = 1 — P(C or N) = 1 - 0.40 = 0.60 


Exercise: 


Problem: 


A college finds that 10% of students have taken a distance learning class and that 40% of 
students are part time students. Of the part time students, 20% have taken a distance learning 
class. Let D = event that a student takes a distance learning class and E = event that a student 
is a part time student 


A Find P(D and E) 

B Find P(E | D) 

C Find P(D or E) 

¢ D Using an appropriate test, show whether D and E are independent. 

e E Using an appropriate test, show whether D and E are mutually exclusive. 


Solution: 


¢ A P(D and E) = P(D|E)P(E) = (0.20)(0.40) = 0.08 

B P(E|D) = P(D and E) / P(D) = 0.08/0.10 = 0.80 

C P(D or E) = P(D) + P(E) — P(D and E) = 0.10 + 0.40 — 0.08 = 0.42 

¢ D Not Independent: P(D|E) = 0.20 which does not equal P(D) = .10 

e E Not Mutually Exclusive: P(D and E) = 0.08 ; if they were mutually exclusive then we 
would need to have P(D and E) = 0, which is not true here. 


Exercise: 


Problem: 


When the Euro coin was introduced in 2002, two math professors had their statistics 
students test whether the Belgian 1 Euro coin was a fair coin. They spun the coin rather than 
tossing it, and it was found that out of 250 spins, 140 showed a head (event H) while 110 
showed a tail (event T). Therefore, they claim that this is not a fair coin. 


e A Based on the data above, find P(H) and P(T). 

¢ B Use a tree to find the probabilities of each possible outcome for the experiment of 
tossing the coin twice. 

e C Use the tree to find the probability of obtaining exactly one head in two tosses of the 
coin. 

¢ D Use the tree to find the probability of obtaining at least one head. 


Solution: 


¢ A P(H) = 140/250; P(T) = 110/250 
° €308/625 
° D 504/625 


Exercise: 
Problem: 
A box of cookies contains 3 chocolate and 7 butter cookies. Miguel randomly selects a 


cookie and eats it. Then he randomly selects another cookie and eats it also. (How many 
cookies did he take?) 


e A Draw the tree that represents the possibilities for the cookie selections. Write the 
probabilities along each branch of the tree. 


¢ B Are the probabilities for the flavor of the SECOND cookie that Miguel selects 
independent of his first selection? Explain. 

e C For each complete path through the tree, write the event it represents and find the 
probabilities. 

¢ DLet S be the event that both cookies selected were the same flavor. Find P(S). 

e E Let T be the event that both cookies selected were different flavors. Find P(T) by two 
different methods: by using the complement rule and by using the branches of the tree. 
Your answers should be the same with both methods. 

e F Let U be the event that the second cookie selected is a butter cookie. Find P(U). 


**Exercises 33 - 40 contributed by Roberta Bloom 


Review 
This module provides a number of homework/review exercises related to 
Probability. 


The first six exercises refer to the following study: In a survey of 100 
stocks on NASDAQ, the average percent increase for the past year was 9% 
for NASDAQ stocks. Answer the following: 

Exercise: 


Problem: The “average increase” for all NASDAQ stocks is the: 
e APopulation 
e BStatistic 
e CParameter 
¢ DSample 
e EVariable 


Solution: 


e C Parameter 
Exercise: 


Problem: All of the NASDAQ stocks are the: 


e APopulation 
e BStatistic 
e CParameter 
¢ DSample 
e EVariable 


Solution: 


e A Population 


Exercise: 


Problem: 9% is the: 
e APopulation 
e BStatistic 
e CParameter 
¢ DSample 
e EVariable 


Solution: 


e B Statistic 
Exercise: 


Problem: The 100 NASDAQ stocks in the survey are the: 
e APopulation 
e BStatistic 
e CParameter 
¢ DSample 
e EVariable 


Solution: 


¢ D Sample 
Exercise: 


Problem: The percent increase for one stock in the survey is the: 


e APopulation 
e BStatistic 


e CParameter 
¢ DSample 
e EVariable 
Solution: 
e E Variable 
Exercise: 


Problem: 


Would the data collected be qualitative, quantitative — discrete, or 
quantitative — continuous? 


Solution: 


quantitative - continuous 


The next two questions refer to the following study: Thirty people spent 
two weeks around Mardi Gras in New Orleans. Their two-week weight gain 
is below. (Note: a loss is shown by a negative weight gain.) 


Weight Gain Frequency 
-2 3 
-1 is) 


0 2 


Weight Gain Frequency 


1 4 

4 13 

6 2 

11 1 
Exercise: 


Problem: Calculate the following values: 
e aThe average weight gain for the two weeks 
e bThe standard deviation 
e cThe first, second, and third quartiles 
Solution: 
e a2.27 


° b3.04 
° c-1,4,4 


Exercise: 


Problem: Construct a histogram and a boxplot of the data. 


Probability Topics: Probability Lab (edited: Teegarden) 

This module presents students with a lab exercise allowing them to apply their 
understanding of Probability. Using a Minitab simulation, students will compare theoretical 
probabilities with experimental probabilities as it relates to dice. They will also investigate 
the effect of sample size. 


Probability Lab 


I. Student Learning Outcomes: 
¢ The student will calculate theoretical and empirical probabilities. 
¢ The student will appraise the differences between the two types of probabilities. 


¢ The student will demonstrate an understanding of long-term probabilities. 


II. Theoretical probability for the sum of two dice 
Begin by looking at Theoretical probabilities for the sum of two dice. Let the value in the 


first row be the result for Die 1 and the value in the first column be the value for Die 2. 
Input the sum of the corresponding row and column in each box. 


Using the following table, record the theoretical probabilities. 


Sum 2 3 4 5 6 7 8 9 10 11 12 
Count 


Probability 


Using the table above, determine the following theoretical probabilities 
1. P(sum less than 5) = 

2. P(sum at least 9) = 

3. P(sum at most 6) = 

4. P(sum more than 7) = 

5. P(sum between 3 and 8) = 


6. P(sum less than 11) = 


III. Experimental (empirical) probability for the sum of two dice 


Rolling the dice Using Minitab, simulate rolling two dice 360 times and finding the sum. 
Use Calc > Random Data — Integer, 360 rows and save in die 1, die 2. 


Then use the Calc > Row Statistics. Select Sum, the two die columns and save in Sum Use 
Stat - Tables > Tally to summarize the data. 


Record the experimental probabilities: 


Sum 2 3 4 5 6 7 8 9 10 11 12 
Count 


Probability 


Using the table above, determine the following experimental probabilities 


1. P(sum less than 5) = 

2. P(sum at least 9) = 

3. P(sum at most 6) = 

4. P(sum more than 7) = 

5. P(sum between 3 and 8) = 

6. P(sum less than 11) = 

IV. Essay Questions (On a separate sheet of paper, answer these questions in 
complete sentences.) 


1. How do the empirical probabilities compare to the theoretical probabilities? (You may 
wish to convert the probabilities to percentages for ease of comparison.) 


2. If you increased the number of times you rolled the dice to 720, would the empirical 
probability values change? Why? 


Rerun the simulation and record your results. 


Sum 2 3 4 5 6 7 8 9 10 11 12 
Count 


Probability 


3. Did the increase in the number of trials cause the empirical probabilities and theoretical 
probabilities to be closer together or farther apart? Why? (You may wish to convert the 
probabilities to percentages for ease of comparison.) 


ATTACH THE SESSION WINDOW WITH YOUR RESULTS AND THE ESSAY 
ANSWERS TO THIS COVER SHEET. 


Discrete Random Variables 
This module serves as the introduction to Discrete Random Variables in the 
Elementary Statistics textbook/collection. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


¢ Recognize and understand discrete probability distribution functions, 
in general. 

e Calculate and interpret expected values. 

e Recognize the binomial probability distribution and apply it 
appropriately. 

e Recognize the Poisson probability distribution and apply it 
appropriately (optional). 

e Recognize the geometric probability distribution and apply it 
appropriately (optional). 

e Recognize the hypergeometric probability distribution and apply it 
appropriately (optional). 

e Classify discrete word problems by their distributions. 


Introduction 


A student takes a 10 question true-false quiz. Because the student had such 
a busy schedule, he or she could not study and randomly guesses at each 
answer. What is the probability of the student passing the test with at least a 
70%? 


Small companies might be interested in the number of long distance phone 
calls their employees make during the peak time of the day. Suppose the 
average is 20 calls. What is the probability that the employees make more 
than 20 long distance phone calls during the peak time? 


These two examples illustrate two different types of probability problems 
involving discrete random variables. Recall that discrete data are data that 
you can count. A random variable describes the outcomes of a statistical 


experiment in words. The values of a random variable can vary with each 
repetition of an experiment. 


In this chapter, you will study probability problems involving discrete 
random distributions. You will also study long-term averages associated 
with them. 


Random Variable Notation 


Upper case letters like X or Y denote a random variable. Lower case letters 
like x or y denote the value of a random variable. If X is a random 
variable, then X is written in words. and z is given as a number. 


For example, let X = the number of heads you get when you toss three fair 
coins. The sample space for the toss of three fair coins is TTT THH HTH 
HHT HTT THT TTH HHH. Then, zx = 0, 1, 2, 3. X is in words and z is 
a number. Notice that for this example, the x values are countable 
outcomes. Because you can count the possible values that X can take on 
and the outcomes are random (the z values 0, 1, 2, 3), X is a discrete 
random variable. 


Optional Collaborative Classroom Activity 


Toss a coin 10 times and record the number of heads. After all members of 
the class have completed the experiment (tossed a coin 10 times and 
counted the number of heads), fill in the chart using a heading like the one 
below. Let X = the number of heads in 10 tosses of the coin. 


x Frequency of x Relative Frequency of x 


x Frequency of x Relative Frequency of x 


e Which value(s) of x occurred most frequently? 

e If you tossed the coin 1,000 times, what values could x take on? 
Which value(s) of z do you think would occur most frequently? 

e What does the relative frequency column sum to? 


Glossary 


Random Variable (RV) 
see Variable 


Variable (Random Variable) 
A characteristic of interest in a population being studied. Common 
notation for variables are upper case Latin letters X, Y, Z,...; common 
notation for a specific value from the domain (set of all possible values 
of a variable) are lower case Latin letters x, y, z,.... For example, if X 
is the number of children in a family, then x represents a specific 
integer 0, 1, 2, 3, .... Variables in statistics differ from variables in 
intermediate algebra in two following ways. 


e The domain of the random variable (RV) is not necessarily a 
numerical set; the domain may be expressed in words; for 
example, if X = hair color then the domain is {black, blond, gray, 
green, orange}. 

e We can tell what specific value xz of the Random Variable X takes 
only after performing the experiment. 


Probability Distribution Function (PDF) for a Discrete Random Variable 
This module introduces the Probability Distribution Function (PDF) and its 
characteristics. 


A discrete probability distribution function has two characteristics: 


e Each probability is between 0 and 1, inclusive. 
e The sum of the probabilities is 1. 


Example: 

A child psychologist is interested in the number of times a newborn baby's 
crying wakes its mother after midnight. For a random sample of 50 
mothers, the following information was obtained. Let _X = the number of 
times a newborn wakes its mother after midnight. For this example, x = 0, 
| Pare ac. ey 

P(x) = probability that X takes on a value z. 


x P(x) 

0 ree) = = 
1 P(ix=1) = _ 
2 x2) = 
3 Puss) = = 
4 Pix=4)— = 


P(x=5) = 4 


X takes on the values 0, 1, 2, 3, 4, 5. This is a discrete PDF because 


1. Each P(x) is between 0 and 1, inclusive. 
2. The sum of the probabilities is 1, that is, 


Equation: 


Zp lies eat a es 
50 50 50 50 50 50, 


Example: 


Suppose Nancy has classes 3 days a week. She attends classes 3 days a 
week 80% of the time, 2 days 15% of the time, 1 day 4% of the time, and 


no days 1% of the time. Suppose one week is randomly selected. 
Exercise: 


Problem: 

Let X = the number of days Nancy 

Solution: 

Let X = the number of days Nancy attends class per week. 
Exercise: 

Problem: X takes on what values? 

Solution: 


Oo. 2.-and:s 


Exercise: 


Problem: 


Suppose one week is randomly chosen. Construct a probability 
distribution table (called a PDF table) like the one in the previous 
example. The table should have two columns labeled x and P(x). 
What does the P(x) column sum to? 


Solution: 
x P(x) 
0 0.01 
1 0.04 
2 0.15 
3 0.80 
Glossary 


Probability Distribution Function (PDF) 
A mathematical description of a discrete random variable (RV), given 
either in the form of an equation (formula) , or in the form of a table 
listing all the possible outcomes of an experiment and the probability 
associated with each outcome. 


Example: 

A biased coin with probability 0.7 for a head (in one toss of the coin) is 
tossed 5 times. We are interested in the number of heads (the RV X = the 
number of heads). X is Binomial, so X ~ B(5,0.7) and P(X = x) = 


5 
_7°.3°-*or in the form of the table: 


x 
L P(x = 2) 
0 0.0024 
1 0.0284 
2 0.1323 
3 0.3087 
4 0.3602 


Mean or Expected Value and Standard Deviation 

This module explores the Law of Large Numbers, the phenomenon where 
an experiment performed many times will yield cumulative results closer 
and closer to the theoretical mean over time. 


The expected value is often referred to as the "long-term" average or 
mean . This means that over the long term of doing an experiment over and 
over, you would expect this average. 


The mean of a random variable X is ps. If we do an experiment many times 
(for instance, flip a fair coin, as Karl Pearson did, 24,000 times and let X = 
the number of heads) and record the value of X each time, the average is 
likely to get closer and closer to yz as we keep repeating the experiment. 
This is known as the Law of Large Numbers. 


Note:To find the expected value or long term average, jz, simply multiply 
each value of the random variable by its probability and add the products. 


A Step-by-Step Example 

A men's soccer team plays soccer 0, 1, or 2 days a week. The probability 
that they play O days is 0.2, the probability that they play 1 day is 0.5, and 
the probability that they play 2 days is 0.3. Find the long-term average, jp, 
or expected value of the days per week the men's soccer team plays soccer. 


To do the problem, first let the random variable X = the number of days the 
men's soccer team plays soccer per week. X takes on the values 0, 1, 2. 
Construct a PDF table, adding a column xP(x). In this column, you will 
multiply each z value by its probability. 


x P(x) tP(x) 


0 0.2 (0)(0.2) = 0 
1 0.5 (1)(0.5) = 0.5 
2 0.3 (2)(0.3) = 0.6 


Expected Value TableThis table is called an expected value table. The table 
helps you calculate the expected value or long-term average. 


Add the last column to find the long term average or expected value: 
(0)(0.2)+(1)(0.5)+(2)(0.3)= 0 + 0.5 + 0.6 = 1.1. 


The expected value is 1.1. The men's soccer team would, on the average, 
expect to play soccer 1.1 days per week. The number 1.1 is the long term 
average or expected value if the men's soccer team plays soccer week after 
week after week. We say H=1.1 


Example: 

Find the expected value for the example about the number of times a 
newborn baby's crying wakes its mother after midnight. The expected 
value is the expected number of times a newborn wakes its mother after 
midnight. 


2 P(X) xP(X 


—_—— 


0 P(x=0) = (0)(s5) =0 


1 P(x=1) = 5 (D(a) = 3 
2 P(x=2) = #3 (2)(3>) = 3 
3 PE) = a (3)(s5) = 3 
4 Ria (4)(4)=-4 
5 Poo, 6)(a)=4 


You expect a newborn to wake its mother after midnight 2.1 times, on the 
average. 


Add the last column to find the expected value. jz = Expected Value = 
105 __ 

9 = 2-1 

Exercise: 


Problem: 


Go back and calculate the expected value for the number of days 
Nancy attends classes a week. Construct the third column to do so. 


Solution: 


2.74 days a week. 


Example: 

Suppose you play a game of chance in which five numbers are chosen 
from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. A computer randomly selects five numbers 
from 0 to 9 with replacement. You pay $2 to play and could profit 
$100,000 if you match all 5 numbers in order (you get your $2 back plus 


$100,000). Over the long term, what is your expected profit of playing the 
game? 

To do this problem, set up an expected value table for the amount of money 
you can profit. 

Let X = the amount of money you profit. The values of x are not 0, 1, 2, 3, 
4, 5, 6, 7, 8, 9. Since you are interested in your profit (or loss), the values 
of x are 100,000 dollars and -2 dollars. 

To win, you must get all 5 numbers correct, in order. The probability of 
choosing one correct number is ay because there are 10 numbers. You may 
choose a number more than once. The probability of choosing all 5 
numbers correctly and in order is: 

Equation: 


—_*__* __* ___*__*_ 1*19~° = 0.00001 
10 10 10 10 10 ees 


Therefore, the probability of winning is 0.00001 and the probability of 
losing is 
Equation: 


1 — 0.00001 = 0.99999 


The expected value table is as follows. 


x P(x) xzP(x) 
Loss 2 0.99999 (-2)(0.99999)=-1.99998 
Profit 100,000 0.00001 (100000)(0.00001)=1 


Add the last column. -1.99998 + 1 = -0.99998 


Since —0.99998 is about —1, you would, on the average, expect to lose 
approximately one dollar for each game you play. However, each time you 
play, you either lose $2 or profit $100,000. The $1 is the average or 
expected LOSS per game after playing this game over and over. 


Example: 
Suppose you play a game with a biased coin. You play each game by 
tossing the coin once. P(heads) = 3 and P(tails) = 4. If you toss a 


head, you pay $6. If you toss a tail, you win $10. If you play this game 
many times, will you come out ahead? 
Exercise: 


Problem: Define a random variable X. 


Solution: 


X = amount of profit 


Exercise: 


Problem: Complete the following expected value table. 


eo] 


WIN 10 


LOSE 


Solution: 


x P(x) xzP(x) 
WIN 10 + 2 
LOSE -6 2 


Exercise: 


Problem: What is the expected value, ? Do you come out ahead? 


Solution: 


Add the last column of the table. The expected value pz = =. You 


lose, on average, about 67 cents each time you play the game so you 
do not come out ahead. 


Like data, probability distributions have standard deviations. To calculate 
the standard deviation (@) of a probability distribution, find each deviation 
from its expected value, square it, multiply it by its probability, add the 
products, and take the square root . To understand how to do the calculation, 
look at the table for the number of days per week a men's soccer team plays 
soccer. To find the standard deviation, add the entries in the column labeled 


(x — p)?- P(z) and take the square root. 


z P(x) aP(x) (x -p)°P (x) 


0 | o2 (0)(0.2) = 0 (0 — 1.1) (.2) ~ 0.242 
1 05 (1)(0.5) = 0.5 Geta)? (.5) = 0.005 
2 | 03 (2)(0.3) = 0.6 44) (.3) ~ 0.243 


Add the last column in the table. 0.242 + 0.005 + 0.243 = 0.490. The 
standard deviation is the square root of 0.49. 0 = / 0.49 — 0.7 


Generally for probability distributions, we use a calculator or a computer to 
calculate 4 and o to reduce roundoff error. For some probability 
distributions, there are short-cut formulas that calculate jz and o. 


Glossary 


Expected Value 
Expected arithmetic average when an experiment is repeated many 
times. (Also called the mean). Notations: F(a), jw. For a discrete 
random variable (RV) with probability distribution function P(a),the 
definition can also be written in the form E(x) = uw = 5) xP(z). 


Mean 
A number that measures the central tendency. A common name for 
mean is ‘average.’ The term 'mean' is a shortened form of ‘arithmetic 
mean.’ By definition, the mean for a sample (denoted by 2) is 
__ Sum of all values in the sample dth i lati 
Lor Number of values in the sample ’ at € mean for a population 
Sum of all values in the population 
Number of values in the population ° 


(denoted by 4) is w = 


Common Discrete Probability Distribution Functions 

This module serves as a lead-in for several types of common discrete 
probability distribution functions, including binomial, geometric, 
hypergeometric, and Poisson. 


Some of the more common discrete probability functions are binomial, 
geometric, hypergeometric, and Poisson. Most elementary courses do not 
cover the geometric, hypergeometric, and Poisson. Your instructor will let 
you know if he or she wishes to cover these distributions. 


A probability distribution function is a pattern. You try to fit a probability 
problem into a pattern or distribution in order to perform the necessary 
calculations. These distributions are tools to make solving probability 
problems easier. Each distribution has its own special characteristics. 
Learning the characteristics enables you to distinguish among the different 
distributions. 


Binomial 
This module describes the characteristics of a binomial experiment and the 
binomial probability distribution function. 


The characteristics of a binomial experiment are: 


1. 


Z. 


There are a fixed number of trials. Think of trials as repetitions of an 
experiment. The letter m denotes the number of trials. 

There are only 2 possible outcomes, called "success" and, "failure" for 
each trial. The letter p denotes the probability of a success on one trial 
and q denotes the probability of a failure on one trial. p + q = 1. 


. The 7 trials are independent and are repeated using identical 


conditions. Because the n trials are independent, the outcome of one 
trial does not help in predicting the outcome of another trial. Another 
way of saying this is that for each individual trial, the probability, p, of 
a success and probability, g, of a failure remain the same. For example, 
randomly guessing at a true - false statistics question has only two 
outcomes. If a success is guessing correctly, then a failure is guessing 
incorrectly. Suppose Joe always guesses correctly on any statistics true 
- false question with probability p = 0.6. Then, q = 0.4 .This means 
that for every true - false statistics question Joe answers, his 
probability of success (p = 0.6) and his probability of failure (q = 0.4 
) remain the same. 


The outcomes of a binomial experiment fit a binomial probability 
distribution. The random variable X = the number of successes obtained 
in the n independent trials. 


The mean, jz, and variance, o”, for the binomial probability distribution is 
y. = np and o? = npq. The standard deviation, a, is then o = ,/npq. 


Any experiment that has characteristics 2 and 3 and where n = 1 is calleda 
Bernoulli Trial (named after Jacob Bernoulli who, in the late 1600s, 
studied them extensively). A binomial experiment takes place when the 
number of successes is counted in one or more Bernoulli Trials. 


Example: 

At ABC College, the withdrawal rate from an elementary physics course is 
30% for any given term. This implies that, for any given term, 70% of the 
students stay in the class for the entire term. A "success" could be defined 
as an individual who withdrew. The random variable is X = the number of 
students who withdraw from the randomly selected elementary physics 
class. 


Example: 

Suppose you play a game that you can only either win or lose. The 
probability that you win any game is 55% and the probability that you lose 
is 45%. Each game you play is independent. If you play the game 20 times, 
what is the probability that you win 15 of the 20 games? Here, if you 
define X = the number of wins, then X takes on the values 0, 1, 2, 3, ..., 
20. The probability of a success is p = 0.55. The probability of a failure is 
q = 0.45. The number of trials ism = 20. The probability question can be 
stated mathematically as P(a = 15). 


Example: 

A fair coin is flipped 15 times. Each flip is independent. What is the 
probability of getting more than 10 heads? Let X = the number of heads in 
15 flips of the fair coin. X takes on the values 0, 1, 2, 3, ..., 15. Since the 
coin is fair, p = 0.5 and q = 0.5. The number of trials is n = 15. The 
probability question can be stated mathematically as P(a > 10). 


Example: 

Approximately 70% of statistics students do their homework in time for it 
to be collected and graded. Each student does homework independently. In 
a Statistics class of 50 students, what is the probability that at least 40 will 
do their homework on time? Students are selected randomly. 

Exercise: 


Problem: 


This is a binomial problem because there is only a success or a 
, there are a definite number of trials, and the probability 
of a success is 0.70 for each trial. 


Solution: 


failure 
Exercise: 


Problem: 


If we are interested in the number of students who do their homework, 
then how do we define X? 


Solution: 

X =the number of statistics students who do their homework on time 
Exercise: 

Problem: What values does zx take on? 


Solution: 
Ole 0) 
Exercise: 


Problem: What is a "failure", in words? 


Solution: 


Failure is a student who does not do his or her homework on time. 


The probability of a success is p = 0.70. The number of trial is n = 50. 
Exercise: 


Problem: If p + gq = 1, then what is q? 
Solution: 


gq = 0.30 
Exercise: 
Problem: 


The words "at least" translate as what kind of inequality for the 
probability question P(a____40). 


Solution: 


greater than or equal to (=) 


The probability question is P(a > 40). 


Notation for the Binomial: B = Binomial Probability 
Distribution Function 


X ~ B(n,p) 


Read this as ".X is arandom variable with a binomial distribution." The 
parameters are n and p. n = number of trials p = probability of a success on 
each trial 


Example: 

It has been stated that about 41% of adult workers have a high school 
diploma but do not pursue any further education. If 20 adult workers are 
randomly selected, find the probability that at most 12 of them have a high 
school diploma but do not pursue any further education. How many adult 


workers do you expect to have a high school diploma but do not pursue 
any further education? 

Let X = the number of workers who have a high school diploma but do not 
pursue any further education. 

X takes on the values 0, 1, 2, ..., 20 where n = 20 and p = 0.41. q=1- 
0.41 = 0.59. X ~ B(20, 0.41) 

Find P(x < 12). P(a < 12) = 0.9738. (calculator or computer) 

Using the TI-83+ or the TI-84 calculators, the calculations are as follows. 
Go into 2nd DISTR. The syntax for the instructions are 

To calculate (x = value): binompdf(n, p, number) If "number" is left out, 
the result is the binomial probability table. 

To calculate P(x < value): binomcdf(n, p, number) If "number" is left 
out, the result is the cumulative binomial probability table. 

For this problem: After you are in 2nd DISTR, arrow down to 
binomcdf. Press ENTER. Enter 20,.41,12). The result is 

(12 O88: 


Note:If you want to find P(a = 12), use the pdf (binompdf). If you want 
to find P(x>12), use 1 - binomcdf(20,.41,12). 


The probability at most 12 workers have a high school diploma but do not 
pursue any further education is 0.9738 
The graph of x ~ B(20, 0.41) is: 


0123 4 Sires» 20 


The y-axis contains the probability of 2, where X = the number of workers 
who have only a high school diploma. 

The number of adult workers that you expect to have a high school 
diploma but not pursue any further education is the mean, 

fe — np —"(20)( OAT) 62; 

The formula for the variance is o“ = npq. The standard deviation is 

o = ./upq. o = »/(20)(0.41)(0.59) = 2.20. 


2 


Example: 

The following example illustrates a problem that is not binomial. It 
violates the condition of independence. ABC College has a student 
advisory committee made up of 10 staff members and 6 students. The 
committee wishes to choose a chairperson and a recorder. What is the 
probability that the chairperson and recorder are both students? All names 
of the committee are put into a box and two names are drawn without 
replacement. The first name drawn determines the chairperson and the 
second name the recorder. There are two trials. However, the trials are not 
independent because the outcome of the first trial affects the outcome of 
the second trial. The probability of a student on the first draw is +. The 


probability of a student on the second draw is — when the first draw 
produces a student. The probability is a when the first draw produces a 


staff member. The probability of drawing a student's name changes for 
each of the trials and, therefore, violates the condition of independence. 


Glossary 


Bernoulli Trials 
An experiment with the following characteristics: 


e There are only 2 possible outcomes called “success” and “failure” 
for each trial. 

e The probability p of a success is the same for any trial (so the 
probability g = 1 — p of a failure is the same for any trial). 


Binomial Distribution 
A discrete random variable (RV) which arises from Bernoulli trials. 
There are a fixed number, n, of independent trials. “Independent” 
means that the result of any trial (for example, trial 1) does not affect 
the results of the following trials, and all trials are conducted under the 
same conditions. Under these circumstances the binomial RV X is 
defined as the number of successes in n trials. The notation is: X~ 
B(n, p). The mean is js = np and the standard deviation is o = ,/npq 
. The probability of exactly z successes in 7 trials is 


P(X = 2) = (2 )p*q"*. 


Geometric (optional) 

This module describes the geometric experiment and the geometric 
probability distribution. This module is included in the Collaborative 
Statistics textbook/collection as an optional lesson. 


The characteristics of a geometric experiment are: 


1. There are one or more Bernoulli trials with all failures except the last 
one, which is a success. In other words, you keep repeating what you 
are doing until the first success. Then you stop. For example, you 
throw a dart at a bull's eye until you hit the bull's eye. The first time 
you hit the bull's eye is a "success" so you stop throwing the dart. It 
might take you 6 tries until you hit the bull's eye. You can think of the 
trials as failure, failure, failure, failure, failure, success. STOP. 

. In theory, the number of trials could go on forever. There must be at 
least one trial. 

. The probability,p, of a success and the probability, qg, of a failure is the 
same for each trial. p + gq = 1 and q = 1 — p. For example, the 
probability of rolling a 3 when you throw one fair die is <- This is true 
no matter how many times you roll the die. Suppose you want to know 
the probability of getting the first 3 on the fifth roll. On rolls 1, 2, 3, 
and 4, you do not get a face with a 3. The probability for each of rolls 
1, 2, 3, and 4 is gq = 2, the probability of a failure. The probability of 


getting a 3 on the fifth roll is 2 . 2 . 2. : 2 : + = 0.0804 


X = the number of independent trials until the first success. The mean and 


variance are in the summary in this chapter. 


Example: 
You play a game of chance that you can either win or lose (there are no 


other possibilities) until you lose. Your probability of losing is p = 0.57. 


What is the probability that it takes 5 games until you lose? Let X = the 


number of games you play until you lose (includes the losing game). Then 
X takes on the values 1, 2, 3, ... (could go on indefinitely). The probability 


question is P(x = 5). 


Example: 

A safety engineer feels that 35% of all industrial accidents in her plant are 
caused by failure of employees to follow instructions. She decides to look 
at the accident reports (selected randomly and replaced in the pile after 
reading) until she finds one that shows an accident caused by failure of 
employees to follow instructions. On the average, how many reports would 
the safety engineer expect to look at until she finds a report showing an 
accident caused by employee failure to follow instructions? What is the 
probability that the safety engineer will have to examine at least 3 reports 
until she finds a report showing an accident caused by employee failure to 
follow instructions? 

Let X = the number of accidents the safety engineer must examine until 
she finds a report showing an accident caused by employee failure to 
follow instructions. X takes on the values 1, 2, 3, .... The first question 
asks you to find the expected value or the mean. The second question asks 
you to find P(x > 3). ("At least" translates as a "greater than or equal to" 
symbol). 


Example: 

Suppose that you are looking for a student at your college who lives within 
five miles of you. You know that 55% of the 25,000 students do live within 
five miles of you. You randomly contact students from the college until 
one says he/she lives within five miles of you. What is the probability that 
you need to contact four people? 

This is a geometric problem because you may have a number of failures 
before you have the one success you desire. Also, the probability of a 
success stays the same each time you ask a student if he/she lives within 
five miles of you. There is no definite number of trials (number of times 
you ask a student). 

Exercise: 


Problem: 


Let X = the number of you must ask 
one says yes. 


Solution: 


Let X = the number of students you must ask until one says yes. 


Exercise: 


Problem: What values does X take on? 


Solution: 


1, 2, 3, ..., (total number of students) 


Exercise: 


Problem: What are p and q? 


Solution: 
e p=0.55 
e g=0.45 
Exercise: 
Problem: The probability question is P( ). 
Solution: 
Pi) 


Notation for the Geometric: G = Geometric Probability 
Distribution Function 


X*~ Gp) 


Read this as "X is arandom variable with a geometric distribution." The 
parameter is p. p = the probability of a success for each trial. 


Example: 

Assume that the probability of a defective computer component is 0.02. 
Components are randomly selected. Find the probability that the first 
defect is caused by the 7th component tested. How many components do 
you expect to test until one is found to be defective? 

Let X = the number of computer components tested until the first defect is 
found. 

X takes on the values 1, 2, 3, ... where p = 0.02. X ~ G(0.02) 

Find P(x = 7). P(a = 7) = 0.0177. (calculator or computer) 

TI-83+ and TI-84: For a general discussion, see this example (binomial). 
The syntax is similar. The geometric parameter list is (p, number) If 
"number" is left out, the result is the geometric probability table. For this 
problem: After you are in 2nd DISTR, arrow down to D:geometpdf. 
Press ENTER. Enter .02,7). The result is P(x = 7) = 0.0177. 

The probability that the 7th component is the first defect is 0.0177. 

The graph of X ~ G(0.02) is: 


P(X=x) 


0.01 
0.005 
0 


The y-axis contains the probability of x, where X = the number of 
computer components tested. 
The number of components that you would expect to test until you find the 
first defective one is the mean, jz = 50. 

1 


The formula for the mean is wy = - = 00 = 50 


The formula for the variance is 


oP = 1. (1-1) = gy: (gy - 1) = 2480 


The standard deviation is 


/ 1 1 1 1 


Glossary 


Geometric Distribution 
A discrete random variable (RV) which arises from the Bernoulli trials. 
The trials are repeated until the first success. The geometric variable X 
is defined as the number of trials until the first success. Notation: X ~ 
G(p). The mean is pw = . and the standard deviation is 


C= > : (+ = 1) The probability of exactly x failures before the 


first success is given by the formula: P(X = x) = p(1— p)?"1. 


Hypergeometric (optional) 

This module describes the properties of a hypergeometric experiment and 
hypergeometric probability distribution. This module is included in the 
Collaborative Statistics textbook/collection as an optional lesson. 


The characteristics of a hypergeometric experiment are: 


1. You take samples from 2 groups. 

2. You are concerned with a group of interest, called the first group. 

3. You sample without replacement from the combined groups. For 
example, you want to choose a softball team from a combined group of 
11 men and 13 women. The team consists of 10 players. 

4. Each pick is not independent, since sampling is without replacement. 
In the softball example, the probability of picking a women first is oe 

The probability of picking a man second is sy if a woman was picked 

first. It is iy if a man was picked first. The probability of the second 


pick depends on what happened in the first pick. 
5. You are not dealing with Bernoulli Trials. 


The outcomes of a hypergeometric experiment fit a hypergeometric 
probability distribution. The random variable X = the number of items 
from the group of interest. The mean and variance are given in the 
summary. 


Example: 

A candy dish contains 100 jelly beans and 80 gumdrops. Fifty candies are 
picked at random. What is the probability that 35 of the 50 are gumdrops? 
The two groups are jelly beans and gumdrops. Since the probability 
question asks for the probability of picking gumdrops, the group of interest 
(first group) is gumdrops. The size of the group of interest (first group) is 
80. The size of the second group is 100. The size of the sample is 50 (jelly 
beans or gumdrops). Let X = the number of gumdrops in the sample of 50. 
X takes on the values x = 0, 1, 2, ..., 50. The probability question is 

VG 85), 


Example: 

Suppose a shipment of 100 VCRs is known to have 10 defective VCRs. An 
inspector randomly chooses 12 for inspection. He is interested in 
determining the probability that, among the 12, at most 2 are defective. 
The two groups are the 90 non-defective VCRs and the 10 defective VCRs. 
The group of interest (first group) is the defective group because the 
probability question asks for the probability of at most 2 defective VCRs. 
The size of the sample is 12 VCRs. (They may be non-defective or 
defective.) Let X = the number of defective VCRs in the sample of 12. X 
takes on the values 0, 1, 2, ..., 10. X may not take on the values 11 or 12. 
The sample size is 12, but there are only 10 defective VCRs. The inspector 
wants to know P(# < 2) ("At most" means "less than or equal to"). 


Example: 

You are president of an on-campus special events organization. You need a 
committee of 7 to plan a special birthday party for the president of the 
college. Your organization consists of 18 women and 15 men. You are 
interested in the number of men on your committee. If the members of the 
committee are randomly selected, what is the probability that your 
committee has more than 4 men? 

This is a hypergeometric problem because you are choosing your 
committee from two groups (men and women). 

Exercise: 


Problem: Are you choosing with or without replacement? 


Solution: 


Without 


Exercise: 


Problem: What is the group of interest? 


Solution: 


The men 

Exercise: 
Problem: How many are in the group of interest? 
Solution: 
15 men 

Exercise: 
Problem: How many are in the other group? 
Solution: 


18 women 
Exercise: 


Problem: 

Let X = on the committee. What values does X take on? 

Solution: 

Let X = the number of men on the committee. x = 0, 1, 2, ..., 7. 
Exercise: 

Problem:The probability question is P(___). 


Solution: 


P(x>4) 


Notation for the Hypergeometric: H = Hypergeometric 
Probability Distribution Function 


X~H(r, b, n) 


Read this as ".X is arandom variable with a hypergeometric distribution." 
The parameters are r, b, and n. r = the size of the group of interest (first 
group), b = the size of the second group, n = the size of the chosen sample 


Example: 

A school site committee is to be chosen randomly from 6 men and 5 
women. If the committee consists of 4 members chosen randomly, what is 
the probability that 2 of them are men? How many men do you expect to 
be on the committee? 

Let X = the number of men on the committee of 4. The men are the group 
of interest (first group). 

X takes on the values 0, 1, 2, 3, 4, where r = 6,6 = 5,andn= 4. 
X~H(6, 5, 4) 

Find P(x = 2). P(x = 2) = 0.4545 (calculator or computer) 


Note:Currently, the TI-83+ and TI-84 do not have hypergeometric 
probability functions. There are a number of computer packages, 
including Microsoft Excel, that do. 


The probability that there are 2 men on the committee is about 0.45. 
The graph of X~H(6, 5, 4) is: 


P(X=x) 


The y-axis contains the probability of X, where X = the number of men on 


the committee. 
You would expect m = 2.18 (about 2) men on the committee. 


The formula for the mean is uw = 77> = oe =218 
The formula for the variance is fairly complex. You will find it in the 
Summary of the Discrete Probability Functions Chapter, 


Glossary 


Hypergeometric Distribution 
A discrete random variable (RV) that is characterized by 


e A fixed number of trials. 
e The probability of success is not the same from trial to trial. 


We sample from two groups of items when we are interested in only 
one group. X is defined as the number of successes out of the total 
number of items chosen. Notation: X~H(r,b,n)., where r = the 


number of items in the group of interest, b = the number of items in the 
group not of interest, and nm = the number of items chosen. 


Poisson 

This module describes the characteristics of a Poisson experiment and the 
Poisson probability distribution. This module is included in the Elementary 
Statistics textbook/collection as an optional lesson. 


Characteristics of a Poisson experiment: 


1. The Poisson gives the probability of a number of events occurring in a 
fixed interval of time or space if these events happen with a known 
average rate and independently of the time since the last event. For 
example, a book editor might be interested in the number of words 
spelled incorrectly in a particular book. It might be that, on the 
average, there are 5 words spelled incorrectly in 100 pages. The 
interval is the 100 pages. 

2. The Poisson may be used to approximate the binomial if the 
probability of success is "small" (such as 0.01) and the number of trials 
is "large" (such as 1000). You will verify the relationship in the 
homework exercises. n is the number of trials and p is the probability 
of a "success." 


Poisson probability distribution. The random variable X = the number of 
occurrences in the interval of interest. The mean and variance are given in 
the summary. 


Example: 

The average number of loaves of bread put on a shelf in a bakery in a half- 
hour period is 12. Of interest is the number of loaves of bread put on the 
shelf in 5 minutes. The time interval of interest is 5 minutes. What is the 
probability that the number of loaves, selected randomly, put on the shelf 
in 5 minutes is 3? 

Let X = the number of loaves of bread put on the shelf in 5 minutes. If the 
average number of loaves put on the shelf in 30 minutes (half-hour) is 12, 
then the average number of loaves put on the shelf in 5 minutes is 
(=) - 12 = 2 loaves of bread 


The probability question asks you to find P(x = 3). 


Example: 

A certain bank expects to receive 6 bad checks per day, on average. What 
is the probability of the bank getting fewer than 5 bad checks on any given 
day? Of interest is the number of checks the bank receives in 1 day, so the 
time interval of interest is 1 day. Let X = the number of bad checks the 
bank receives in one day. If the bank expects to receive 6 bad checks per 
day then the average is 6 checks per day. The probability question asks for 
Jed acess) 


Example: 

You notice that a news reporter says "uh", on average, 2 times per 
broadcast. What is the probability that the news reporter says "uh" more 
than 2 times per broadcast. 

This is a Poisson problem because you are interested in knowing the 
number of times the news reporter says "uh" during a broadcast. 
Exercise: 


Problem: What is the interval of interest? 
Solution: 


One broadcast 
Exercise: 


Problem: 


What is the average number of times the news reporter says "uh" 
during one broadcast? 


Solution: 


2 


Exercise: 


Problem: Let X = . What values does X take on? 
Solution: 


Let X = the number of times the news reporter says "uh" during 
one broadcast. 
faa Va lee are eae 


Exercise: 


Problem: The probability question is P(____). 


Solution: 


Pixe2) 


Notation for the Poisson: P = Poisson Probability Distribution 
Function 


X~ P(u) 


Read this as "X is arandom variable with a Poisson distribution." The 
parameter is yz (or A). ps (or A) = the mean for the interval of interest. 


Example: 

Leah's answering machine receives about 6 telephone calls between 8 a.m. 
and 10 a.m. What is the probability that Leah receives more than 1 call in 

the next 15 minutes? 

Let X = the number of calls Leah receives in 15 minutes. (The interval of 
interest is 15 minutes or J. hour.) 

(bee 0 Della Bo Peas 


If Leah receives, on the average, 6 telephone calls in 2 hours, and there are 
eight 15 minutes intervals in 2 hours, then Leah receives 

= °6=0.75 

calls in 15 minutes, on the average. So, yz = 0.75 for this problem. 

D.C eo Vea((0). (659) 

Find P(x > 1). P(x > 1) = 0.1734 (calculator or computer) 

TI-83+ and TI-84: For a general discussion, see this example (Binomial). 
The syntax is similar. The Poisson parameter list is (jz for the interval of 
interest, number). For this problem: 

Press 1- and then press 2nd DISTR. Arrow down to C:poissoncdf. 
Press ENTER. Enter .75,1). The result is P(z > 1) = 0.1734. NOTE: 
The TI calculators use \ (lambda) for the mean. 

The probability that Leah receives more than 1 telephone call in the next 
fifteen minutes is about 0.1734. 

The graph of X ~ P(0.75) is: 


P(X=x) 


The y-axis contains the probability of z where X = the number of calls in 
15 minutes. 


Glossary 


Poisson Distribution 
A discrete random variable (RV) that counts the number of times a 
certain event will occur in a specific interval. Characteristics of the 
variable: 


e The probability that the event occurs in a given interval is the 
same for all intervals. 

e The events occur with a known mean and independently of the 
time since the last event. 


The distribution is defined by the mean yz of the event in the interval. 
Notation: X~ P(:). The mean is 4 = np. The standard deviation is 
o = ,/p. The probability of having exactly x successes in r trials is 


P(X = 2x) =e “4. The Poisson distribution is often used to 
approximate the binomial distribution when n is “large” and p is 
“small” (a general rule is that n should be greater than or equal to 20 


and p should be less than or equal to .05). 


Summary of Functions 

This module provides a review of the binomial, geometric, hypergeometric, 
and Poisson probability distribution functions and their properties. 
Formula 

Binomial 


X~B(n,p) 

X =the number of successes in n independent trials 
m = the number of independent trials 

X takes on the values z = 0,1, 2, 3, ...,n 

p = the probability of a success for any trial 

q = the probability of a failure for any trial 
Parga. GaLop 


The mean is 2 = np. The standard deviation is o = ,/npq. 


Formula 
Geometric 


X~G(p) 


X = the number of independent trials until the first success (count the 
failures and the first success) 


X takes on the values x= 1, 2, 3, ... 

p = the probability of a success for any trial 
q = the probability of a failure for any trial 
p+q=1 


q=1=p 


The mean is = : 


The standard deviation is o = > ( (+) — 1) 


Formula 
Hypergeometric 


X~H(r,b,n) 


X = the number of items from the group of interest that are in the chosen 
sample. 


X may take on the values x= 0, 1, ..., up to the size of the group of interest. 
(The minimum value for X may be larger than 0 in some instances.) 


r = the size of the group of interest (first group) 
b= the size of the second group 
n= the size of the chosen sample. 


n<r+b 


nr 


r+b 


The mean is: 2 = 


The standard deviation is: ¢ = eer a 
(r+b)°(r+b-1) 

Formula 

Poisson 

X ~ P(u) 


X =the number of occurrences in the interval of interest 
X takes on the values zx = 0, 1, 2, 3, ... 


The mean p is typically given. (A is often used as the mean instead of ju.) 
When the Poisson is used to approximate the binomial, we use the binomial 


mean ps = np. n is the binomial number of trials. p = the probability of a 
success for each trial. This formula is valid when n is "large" and p "small" 
(a general rule is that n should be greater than or equal to 20 and p should 
be less than or equal to 0.05). If n is large enough and p is small enough 
then the Poisson approximates the binomial very well. The variance is 

o? = pand the standard deviation is o = a/b 


Practice 1: Discrete Distribution 

This module provides students an opportunity to practice applying concepts 
related to discrete distributions. This practice exercise asks students to 
calculate several values based on the data provided. 


Student Learning Outcomes 


e The student will analyze the properties of a discrete distribution. 


Given: 


A ballet instructor is interested in knowing what percent of each year's class 
will continue on to the next, so that she can plan what classes to offer. Over 
the years, she has established the following probability distribution. 


e Let =the number of years a student will study ballet with the 


teacher. 
e Let = the probability that a student will study ballet years. 


Organize the Data 


Complete the table below using the data provided. 


x P(x) x*P(x) 
1 0.10 
2 0.05 


3 0.10 


x P(x) x*P(x) 


4 

5 0.30 

6 0.20 

Zz 0.10 
Exercise: 


Problem: In words, define the Random Variable 


Exercise: 


Problem: 


Exercise: 


Problem: 
Exercise: 


Problem: 
On average, how many years would you expect a child to study ballet 
with this teacher? 

Discussion Question 

Exercise: 


Problem: What does the column " "sum to and why? 


Exercise: 


Problem: What does the column " " sum to and why? 


Practice 2: Binomial Distribution 

This module provides a practice of Binomial Distribution as a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 


Student Learning Outcomes 


e The student will construct the Binomial Distribution. 


Given 


The Higher Education Research Institute at UCLA collected data from 
203,967 incoming first-time, full-time freshmen from 270 four-year 
colleges and universities in the U.S. 71.3% of those students replied that, 
yes, they believe that same-sex couples should have the right to legal 
marital status. (Source: 
http://heri.ucla.edu/PDFs/pubs/TFS/Norms/Monographs/TheAmericanFres 
hman2011.pdf). ) 


Suppose that you randomly pick 8 first-time, full-time freshmen from the 


survey. You are interested in the number that believes that same sex-couples 
should have the right to legal marital status 


Interpret the Data 


Exercise: 


Problem: In words, define the random Variable X. 


Solution: 


X= the number that reply “yes” 


Exercise: 


Problem: X~ 


Solution: 
B(8,0.713) 


Exercise: 


Problem: What values does the random variable X take on? 


Solution: 
0,1,2.5:4:5;6;7,8 


Exercise: 


Problem: Construct the probability distribution function (PDF). 


Exercise: 


Problem: 


On average (u), how many would you expect to answer yes? 


Solution: 


D7 


Exercise: 


Problem: What is the standard deviation (c) ? 


Solution: 


1.28 
Exercise: 


Problem: 


What is the probability that at most 5 of the freshmen reply “yes”? 


Solution: 


0.4151 
Exercise: 


Problem: 
What is the probability that at least 2 of the freshmen reply “yes”? 
Solution: 


0.9990 
Exercise: 


Problem: 


Construct a histogram or plot a line graph. Label the horizontal and 
vertical axes with words. Include numerical scaling. 


Practice 3: Poisson Distribution 
This module provides further practices and exercises on Poisson 
Distribution in statistics. 


Student Learning Outcomes 


e The student will analyze the properties of a Poisson distribution. 


Given 


On average, eight teens in the U.S. die from motor vehicle injuries per day. 
As a result, states across the country are debating raising the driving age. 
(Source: 
http://www.cdc.gov/Motorvehiclesafety/Teen_Drivers/teendrivers_factsheet. 
html) ) 


Interpret the Data 


Exercise: 


Problem: 


Assume the event occurs independently in any given day. In words, 
define the Random Variable 


Exercise: 


Problem: ~ 


Solution: 


P(8) 


Exercise: 


Problem: What values does take on? 


Solution: 


O12 BA. 
Exercise: 


Problem: 


For the given values of the random variable __, fill in the 
corresponding probabilities. 


Exercise: 


Problem: 


Is it likely that there will be no teens killed in the U.S. from motor 
vehicle injuries on any given day? Justify your answer numerically. 


Solution: 


No 
Exercise: 


Problem: 


Is it likely that there will be more than 20 teens killed in the U.S. from 
motor vehicle injuries on any given day? Justify your answer 
numerically. 


Solution: 


No 


Practice 4: Geometric Distribution 
This module provides further practice with topics of Geometric Distribution 
in Statistics. 


Student Learning Outcomes 


e The student will analyze the properties of a geometric distribution. 


Given: 
Use the information from the Binomial Distribution Practice shown below. 


The Higher Education Research Institute at UCLA collected data from 
203,967 incoming first-time, full-time freshmen from 270 four-year 
colleges and universities in the U.S. 71.3% of those students replied that, 
yes, they believe that same-sex couples should have the right to legal 
marital status. (Source: 
http://heri.ucla.edu/PDFs/pubs/TFS/Norms/Monographs/TheAmericanFres 
hman2011.pdf) 


Suppose that you randomly select freshman from the study until you find 


one who replies “yes.” You are interested in the number of freshmen you 
must ask. 


Interpret the Data 


Exercise: 


Problem: In words, define the Random Variable X. 


Exercise: 


Problem: X ~ 


Solution: 


G(0.713) 


Exercise: 


Problem: What values does the random variable X take on? 


Solution: 


12.2: 
Exercise: 


Problem: 


Construct the probability distribution function (PDF). Stop at z = 6. 


Exercise: 


Problem: 


On average(j), how many freshmen would you expect to have to ask 
until you found one who replies "yes?" 


Solution: 


1 
Exercise: 


Problem: 


What is the probability that you will need to ask fewer than 3 
freshmen? 


Solution: 


0.9176 
Exercise: 


Problem: 


Construct a histogram or plot a line graph. Label the horizontal and 
vertical axes with words. Include numerical scaling. 


Practice 5: Hypergeometric Distribution 
This module provides further practice and exercises on Hypergeometric 
Distribution in statistics 


Student Learning Outcomes 


e The student will analyze the properties of a hypergeometric 
distribution. 


Given 
Suppose that a group of statistics students is divided into two groups: 
business majors and non-business majors. There are 16 business majors in 


the group and 7 non-business majors in the group. A random sample of 9 
students is taken. We are interested in the number of business majors in the 


group. 
Interpret the Data 


Exercise: 


Problem: In words, define the Random Variable X. 


Exercise: 


Problem: X ~ 
Solution: 


H(16,7,9) 


Exercise: 


Problem: What values does X take on? 


Solution: 


2,3,4,5,6,7,8,9 
Exercise: 


Problem: 


Construct the probability distribution function ( ) for X. 


Exercise: 


Problem: 
On average(j), how many would you expect to be business majors? 
Solution: 


6.26 


Homework 

Discrete Random Variables: Homework is part of the collection col10555 
written by Barbara Illowsky and Susan Dean Homework and provides a 
number of homework exercises related to Discrete Random Variables 
(binomial, geometric, hypergeometric and Poisson) with contributions from 
Roberta Bloom. 

Exercise: 


Problem: 1. Complete the PDF and answer the questions. 


£ P(X = 2) POX a) 
0 0.3 

1 0.2 

Z 

3 0.4 


e aFind the probability that z = 2. 
e b Find the expected value. 


Solution: 


e a0.1 
e b1.6 


Exercise: 


Problem: 


Suppose that you are offered the following “deal.” You roll a die. If 
you roll a 6, you win $10. If you roll a 4 or 5, you win $5. If you roll a 
1, 2, or 3, you pay $6. 


e¢ aWhat are you ultimately interested in here (the value of the roll 
or the money you win)? 

e bln words, define the Random Variable X. 

e cList the values that X may take on. 

e dConstruct a PDF. 

e eOver the long run of playing this game, what are your expected 
average winnings per game? 

e fBased on numerical values, should you take the deal? Explain 
your decision in complete sentences. 


Exercise: 


Problem: 


A venture capitalist, willing to invest $1,000,000, has three 
investments to choose from. The first investment, a software company, 
has a 10% chance of returning $5,000,000 profit, a 30% chance of 
returning $1,000,000 profit, and a 60% chance of losing the million 
dollars. The second company, a hardware company, has a 20% chance 
of returning $3,000,000 profit, a 40% chance of returning $1,000,000 
profit, and a 40% chance of losing the million dollars. The third 
company, a biotech firm, has a 10% chance of returning $6,000,000 
profit, a 70% of no profit or loss, and a 20% chance of losing the 
million dollars. 


aConstruct a PDF for each investment. 

bFind the expected value for each investment. 

cWhich is the safest investment? Why do you think so? 
dWhich is the riskiest investment? Why do you think so? 
eWhich investment has the highest expected return, on average? 


Solution: 


¢ b$200,000;$600,000;$400,000 
e cthird investment 

e dfirst investment 

e esecond investment 


Exercise: 


Problem: 


A theater group holds a fund-raiser. It sells 100 raffle tickets for $5 
apiece. Suppose you purchase 4 tickets. The prize is 2 passes to a 
Broadway show, worth a total of $150. 


e aWhat are you interested in here? 

e bln words, define the Random Variable X. 

e cList the values that X may take on. 

e dConstruct a PDF. 

e elf this fund-raiser is repeated often and you always purchase 4 
tickets, what would be your expected average winnings per raffle? 


Exercise: 


Problem: 


Suppose that 20,000 married adults in the United States were randomly 
surveyed as to the number of children they have. The results are 
compiled and are used as theoretical probabilities. Let X = the number 
of children 


0 0.10 
1 0.20 
2 0.30 
3 

4 0.10 
5 0.05 
6 (or more) 0.05 


aFind the probability that a married adult has 3 children. 

bIn words, what does the expected value in this example 
represent? 

c Find the expected value. 

d Is it more likely that a married adult will have 2 — 3 children or 
4 —6 children? How do you know? 


Solution: 


a0.2 
€2:35 
d2-3 children 


Exercise: 


Problem: 


Suppose that the PDF for the number of years it takes to earn a 
Bachelor of Science (B.S.) degree is given below. 


3 0.05 
4 0.40 
is) 0.30 
6 0.15 
7 0.10 


e aln words, define the Random Variable X. 

e b What does it mean that the values 0, 1, and 2 are not included 
for x in the PDF? 

e cOn average, how many years do you expect it to take for an 
individual to earn a B.S.? 


For each problem: 


e aln words, define the Random Variable X. 
¢ bList the values that X may take on. 
e cGive the distribution of X. X~ 


Then, answer the questions specific to each individual problem. 
Exercise: 
Problem: 


Six different colored dice are rolled. Of interest is the number of dice 
that show a “1.” 


e dOn average, how many dice would you expect to show a “1”? 
e eFind the probability that all six dice show a “1.” 


e fls it more likely that 3 or that 4 dice will show a “1”? Use 
numbers to justify your answer numerically. 


Solution: 


e aX =the number of dice that show a 1 
¢ b0,1,2,3,4,5,6 

¢ c X~B(6, =) 

edi 

e e 0.00002 

e f3 dice 


Exercise: 


Problem: 


More than 96 percent of the very largest colleges and universities 
(more than 15,000 total enrollments) have some online offerings. 
Suppose you randomly pick 13 such institutions. We are interested in 
the number that offer distance learning courses. (Source: 
http://en.wikipedia.org/wiki/Distance_education) 


e dOn average, how many schools would you expect to offer such 
courses? 

e eFind the probability that at most 6 offer such courses. 

e fls it more likely that 0 or that 13 will offer such courses? Use 
numbers to justify your answer numerically and answer in a 
complete sentence. 


Exercise: 


Problem: 


A school newspaper reporter decides to randomly survey 12 students 
to see if they will attend Tet (Vietnamese New Year) festivities this 
year. Based on past years, she knows that 18% of students attend Tet 
festivities. We are interested in the number of students who will attend 
the festivities. 


e dHow many of the 12 students do we expect to attend the 
festivities? 

e eFind the probability that at most 4 students will attend. 

e fFind the probability that more than 2 students will attend. 


Solution: 


e a X =the number of students that will attend Tet. 
s- DO; 12. 3..4.°5.67 7,8, 9. 10: 1112 

e c X~B(12,0.18) 

e d2.16 

e e0.9511 

e £0.3702 


Exercise: 
Problem: 


Suppose that about 85% of graduating students attend their graduation. 
A group of 22 graduating students is randomly chosen. 


e dHow many are expected to attend their graduation? 

e eFind the probability that 17 or 18 attend. 

e fBased on numerical values, would you be surprised if all 22 
attended graduation? Justify your answer numerically. 


Exercise: 


Problem: 


At The Fencing Center, 60% of the fencers use the foil as their main 
weapon. We randomly survey 25 fencers at The Fencing Center. We 
are interested in the numbers that do not use the foil as their main 
weapon. 


e dHow many are expected to not use the foil as their main 
weapon? 

e eFind the probability that six do not use the foil as their main 
weapon. 

e fBased on numerical values, would you be surprised if all 25 did 
not use foil as their main weapon? Justify your answer 
numerically. 


Solution: 


e aX =the number of fencers that do not use foil as their main 
weapon 

e bO, 1, 2, 3,... 25 

¢ c X~B(25,0.40) 

e di0 

e e0.0442 

e fYes 


Exercise: 


Problem: 


Approximately 8% of students at a local high school participate in 
after-school sports all four years of high school. A group of 60 seniors 
is randomly chosen. Of interest is the number that participated in after- 
school sports all four years of high school. 


e dHow many seniors are expected to have participated in after- 
school sports all four years of high school? 


e eBased on numerical values, would you be surprised if none of 
the seniors participated in after-school sports all four years of 
high school? Justify your answer numerically. 

e fBased upon numerical values, is it more likely that 4 or that 5 of 
the seniors participated in after-school sports all four years of 
high school? Justify your answer numerically. 


Exercise: 


Problem: 


The chance of having an extra fortune in a fortune cookie is about 3%. 
Given a bag of 144 fortune cookies, we are interested in the number of 
cookies with an extra fortune. Two distributions may be used to solve 
this problem. Use one distribution to solve the problem. 


e dHow many cookies do we expect to have an extra fortune? 

e eFind the probability that none of the cookies have an extra 
fortune. 

e fFind the probability that more than 3 have an extra fortune. 

e gAs n increases, what happens involving the probabilities using 
the two distributions? Explain in complete sentences. 


Solution: 


e aX =the number of fortune cookies that have an extra fortune 
e bO, 1, 2, 3,... 144 

¢ c X~B(144, 0.03) or P(4.32) 

e d4.32 

e e 0.0124 or 0.0133 

e f 0.6300 or 0.6264 


Exercise: 


Problem: 


There are two games played for Chinese New Year and Vietnamese 
New Year. They are almost identical. In the Chinese version, fair dice 
with numbers 1, 2, 3, 4, 5, and 6 are used, along with a board with 
those numbers. In the Vietnamese version, fair dice with pictures of a 
gourd, fish, rooster, crab, crayfish, and deer are used. The board has 
those six objects on it, also. We will play with bets being $1. The 
player places a bet on a number or object. The “house” rolls three dice. 
If none of the dice show the number or object that was bet, the house 
keeps the $1 bet. If one of the dice shows the number or object bet 
(and the other two do not show it), the player gets back his $1 bet, plus 
$1 profit. If two of the dice show the number or object bet (and the 
third die does not show it), the player gets back his $1 bet, plus $2 
profit. If all three dice show the number or object bet, the player gets 
back his $1 bet, plus $3 profit. 


Let X = number of matches and Y= profit per game. 


e dList the values that Y may take on. Then, construct one PDF 
table that includes both X & Y and their probabilities. 

e eCalculate the average expected matches over the long run of 
playing this game for the player. 

e fCalculate the average expected earnings over the long run of 
playing this game for the player. 

e gDetermine who has the advantage, the player or the house. 


Exercise: 


Problem: 


According to the South Carolina Department of Mental Health web 
site, for every 200 U.S. women, the average number who suffer from 
anorexia is one (http:/Awww.state.sc.us/dmh/anorexia/statistics.htm). 
Out of a randomly chosen group of 600 U.S. women: 


¢ dHow many are expected to suffer from anorexia? 
e eFind the probability that no one suffers from anorexia. 


e fFind the probability that more than four suffer from anorexia. 


Solution: 


e a X =the number of women that suffer from anorexia 
e bO, 1, 2, 3,... 600 (can leave off 600) 

e c X~P(3) 

e d3 

e e0.0498 

e £0.1847 


Exercise: 


Problem: 


The average number of children a Japanese woman has in her lifetime 
is 1.37. Suppose that one Japanese woman is randomly chosen. 
(http://www.mhlw.go.jp/english/policy/children/children- 
childrearing/index.html MHLW’s Pamphlet) 


e dFind the probability that she has no children. 

e eFind the probability that she has fewer children than the 
Japanese average. 

e fFind the probability that she has more children than the Japanese 
average. 


Exercise: 


Problem: 


The average number of children a Spanish woman has in her lifetime 
is 1.47. Suppose that one Spanish woman is randomly chosen. 
(http://www.typicallyspanish.com/news/publish/article 4897.shtml). 


e dFind the probability that she has no children. 
e eFind the probability that she has fewer children than the Spanish 
average. 


e fFind the probability that she has more children than the Spanish 
average . 


Solution: 


e a X =the number of children for a Spanish woman 
¢ bo, 1, 2, 3.... 

e c X~P(1.47) 

e d0.2299 

° e0.5679 

e 0.4321 


Exercise: 


Problem: 


Fertile (female) cats produce an average of 3 litters per year. (Source: 
The Humane Society of the United States). Suppose that one fertile, 
female cat is randomly chosen. In one year, find the probability she 
produces: 


e dNo litters. 
e eAt least 2 litters. 
e fExactly 3 litters. 


Exercise: 


Problem: 


A consumer looking to buy a used red Miata car will call dealerships 
until she finds a dealership that carries the car. She estimates the 
probability that any independent dealership will have the car will be 
28%. We are interested in the number of dealerships she must call. 


e dOn average, how many dealerships would we expect her to have 
to call until she finds one that has the car? 
e eFind the probability that she must call at most 4 dealerships. 


e fFind the probability that she must call 3 or 4 dealerships. 


Solution: 


e a X =the number of dealers she calls until she finds one with a 
used red Miata 

e bi, 2, 3... 

e c X~G(0.28) 

e d3.57 

e e0.7313 

e £0.2497 


Exercise: 


Problem: 


Suppose that the probability that an adult in America will watch the 
Super Bowl is 40%. Each person is considered independent. We are 
interested in the number of adults in America we must survey until we 
find one who will watch the Super Bowl. 


e dHow many adults in America do you expect to survey until you 
find one who will watch the Super Bowl? 

e eFind the probability that you must ask 7 people. 

e fFind the probability that you must ask 3 or 4 people. 


Exercise: 


Problem: 


A group of Martial Arts students is planning on participating in an 
upcoming demonstration. 6 are students of Tae Kwon Do; 7 are 
students of Shotokan Karate. Suppose that 8 students are randomly 
picked to be in the first demonstration. We are interested in the number 
of Shotokan Karate students in that first demonstration. Hint: Use the 
Hypergeometric distribution. Look in the Formulas section of 4: 
Discrete Distributions and in the Appendix Formulas. 


e dHow many Shotokan Karate students do we expect to be in that 
first demonstration? 

e eFind the probability that 4 students of Shotokan Karate are 
picked for the first demonstration. 

e fSuppose that we are interested in the Tae Kwan Do students that 
are picked for the first demonstration. Find the probability that all 
6 students of Tae Kwan Do are picked for the first demonstration. 


Solution: 


e d4.31 
e e0.4079 
¢ £0.0163 


Exercise: 


Problem: 


The chance of a IRS audit for a tax return with over $25,000 in income 
is about 2% per year. We are interested in the expected number of 
audits a person with that income has in a 20 year period. Assume each 
year is independent. 


e dHow many audits are expected in a 20 year period? 
e eFind the probability that a person is not audited at all. 
e fFind the probability that a person is audited more than twice. 


Exercise: 


Problem: 


Refer to the previous problem. Suppose that 100 people with tax 
returns over $25,000 are randomly picked. We are interested in the 
number of people audited in 1 year. One way to solve this problem is 
by using the Binomial Distribution. Since n is large and p is small, 
another discrete distribution could be used to solve the following 
problems. Solve the following questions (d-f) using that distribution. 


e dHow many are expected to be audited? 
e eFind the probability that no one was audited. 
e fFind the probability that more than 2 were audited. 


Solution: 


e d2 
e e0.1353 
¢ £0.3233 


Exercise: 


Problem: 


Suppose that a technology task force is being formed to study 
technology awareness among instructors. Assume that 10 people will 
be randomly chosen to be on the committee from a group of 28 
volunteers, 20 who are technically proficient and 8 who are not. We 
are interested in the number on the committee who are not technically 
proficient. 


e dHow many instructors do you expect on the committee who are 
not technically proficient? 

e eFind the probability that at least 5 on the committee are not 
technically proficient. 

e fFind the probability that at most 3 on the committee are not 
technically proficient. 


Exercise: 
Problem: 


Refer back to Exercise 4.15.12. Solve this problem again, using a 
different, though still acceptable, distribution. 


Solution: 


e a X =the number of seniors that participated in after-school 
sports all 4 years of high school 

e b0, 1, 2, 3,... 60 

e c X~P(4.8) 

e d4.8 

e eYes 

e £4 


Exercise: 


Problem: 


Suppose that 9 Massachusetts athletes are scheduled to appear at a 
charity benefit. The 9 are randomly chosen from 8 volunteers from the 
Boston Celtics and 4 volunteers from the New England Patriots. We 
are interested in the number of Patriots picked. 


e dis it more likely that there will be 2 Patriots or 3 Patriots picked? 


Exercise: 


Problem: 


On average, Pierre, an amateur chef, drops 3 pieces of egg shell into 
every 2 batters of cake he makes. Suppose that you buy one of his 
cakes. 


e dOn average, how many pieces of egg shell do you expect to be 
in the cake? 

e eWhat is the probability that there will not be any pieces of egg 
shell in the cake? 

e fLet’s say that you buy one of Pierre’s cakes each week for 6 
weeks. What is the probability that there will not be any egg shell 
in any of the cakes? 

e gBased upon the average given for Pierre, is it possible for there 
to be 7 pieces of shell in the cake? Why? 


Solution: 


a X = the number of shell pieces in one cake 
bO, 1, 2, 3.... 

c X~P(1.5) 

di.5 

e0.2231 

£0.0001 

gYes 


Exercise: 


Problem: 


It has been estimated that only about 30% of California residents have 
adequate earthquake supplies. Suppose we are interested in the number 
of California residents we must survey until we find a resident who 
does not have adequate earthquake supplies. 


dWhat is the probability that we must survey just 1 or 2 residents 
until we find a California resident who does not have adequate 
earthquake supplies? 

eWhat is the probability that we must survey at least 3 California 
residents until we find a California resident who does not have 
adequate earthquake supplies? 

fHow many California residents do you expect to need to survey 
until you find a California resident who does not have adequate 
earthquake supplies? 

gHow many California residents do you expect to need to survey 
until you find a California resident who does have adequate 
earthquake supplies? 


Exercise: 


Problem: 


Refer to the above problem. Suppose you randomly survey 11 
California residents. We are interested in the number who have 
adequate earthquake supplies. 


e d What is the probability that at least 8 have adequate earthquake 
supplies? 

e els it more likely that none or that all of the residents surveyed 
will have adequate earthquake supplies? Why? 

e fHow many residents do you expect will have adequate 
earthquake supplies? 


Solution: 


e d0.0043 
e enone 
* 433 


The next 2 questions refer to the following: In one of its Spring catalogs, 
L.L. Bean® advertised footwear on 29 of its 192 catalog pages. 
Exercise: 


Problem: 


Suppose we randomly survey 20 pages. We are interested in the 
number of pages that advertise footwear. Each page may be picked at 
most once. 


e dHow many pages do you expect to advertise footwear on them? 

¢ els it probable that all 20 will advertise footwear on them? Why 
or why not? 

e fWhat is the probability that less than 10 will advertise footwear 
on them? 


Exercise: 


Problem: 


Suppose we randomly survey 20 pages. We are interested in the 
number of pages that advertise footwear. This time, each page may be 
picked more than once. 


dHow many pages do you expect to advertise footwear on them? 
els it probable that all 20 will advertise footwear on them? Why 
or why not? 

f What is the probability that less than 10 will advertise footwear 
on them? 

gReminder: A page may be picked more than once. We are 
interested in the number of pages that we must randomly survey 
until we find one that has footwear advertised on it. Define the 
random variable X and give its distribution. 

hWhat is the probability that you only need to survey at most 3 
pages in order to find one that advertises footwear on it? 

iHow many pages do you expect to need to survey in order to find 
one that advertises footwear? 


Solution: 


d3.02 

eNo 

£0.9997 
h0.3881 
16.6207 pages 


Exercise: 


Problem: 


Suppose that you roll a fair die until each face has appeared at least 
once. It does not matter in what order the numbers appear. Find the 
expected number of rolls you must make until each face has appeared 
at least once. 


Try these multiple choice problems. 


For the next three problems: The probability that the San Jose Sharks will 
win any given game is 0.3694 based on a 13 year win history of 382 wins 
out of 1034 games played (as of a certain date). An upcoming monthly 
schedule contains 12 games. 

Let X = the number of games won in that upcoming month. 

Exercise: 


Problem: The expected number of wins for that upcoming month is: 


¢ A1.67 


° Bl2 


382 
° C iB 


e D443 


Solution: 


D: 4.43 
Exercise: 


Problem: 


What is the probability that the San Jose Sharks win 6 games in that 
upcoming month? 


e A0.1476 
e B0.2336 
e C0.7664 
e DO0.8903 


Solution: 


A: 0.1476 


Exercise: 


Problem: 


What is the probability that the San Jose Sharks win at least 5 games in 
that upcoming month 


e A0.3694 
e B0.5266 
e C0.4734 
e DO.2305 


Solution: 


C: 0.4734 


For the next two questions: The average number of times per week that 
Mrs. Plum’s cats wake her up at night because they want to play is 10. We 
are interested in the number of times her cats wake her up each week. 
Exercise: 


Problem: In words, the random variable X = 


e A The number of times Mrs. Plum’s cats wake her up each week 
¢ B The number of times Mrs. Plum’s cats wake her up each hour 

e C The number of times Mrs. Plum’s cats wake her up each night 
¢ D The number of times Mrs. Plum’s cats wake her up 


Solution: 


A: The number of times Mrs. Plum's cats wake her up each week 
Exercise: 
Problem: 


Find the probability that her cats will wake her up no more than 5 
times next week. 


e A0.5000 
e B0.9329 
e €0.0378 
e DO0.0671 


Solution: 


D: 0.0671 
Exercise: 


Problem: 


People visiting video rental stores often rent more than one DVD at a 
time. The probability distribution for DVD rentals per customer at 
Video To Go is given below. There is 5 video limit per customer at this 
store, so nobody ever rents more than 5 DVDs. 


x 0 1 2 3 4 fs) 


P(X=x) 0.03 O50 0.24 #? 0.07  & 0.04 


e A Describe the random variable X in words. 

e B Find the probability that a customer rents three DVDs. 

e C Find the probability that a customer rents at least 4 DVDs. 
e D Find the probability that a customer rents at most 2 DVDs. 


Another shop, Entertainment Headquarters, rents DVDs and 
videogames. The probability distribution for DVD rentals per customer 
at this shop is given below. They also have a5 DVD limit per 
customer. 


x 0 1 2 3 4 fs) 


P(X=x) 035 0.25 0.20 010 0.05 0.05 


e E At which store is the expected number of DVDs rented per 
customer higher? 

e F If Video to Go estimates that they will have 300 customers next 
week, how many DVDs do they expect to rent next week? 
Answer in sentence form. 

¢ GIf Video to Go expects 300 customers next week and 
Entertainment HQ projects that they will have 420 customers, for 
which store is the expected number of DVD rentals for next week 
higher? Explain. 

e H Which of the two video stores experiences more variation in 
the number of DVD rentals per customer? How do you know 
that? 


Solution: 


Partial Answer: 

A: X = the number of DVDs a Video to Go customer rents 
B: 0.12 

C: 0.11 

D: 0.77 


Exercise: 
Problem: 
A game involves selecting a card from a deck of cards and tossing a 
coin. The deck has 52 cards and 12 cards are "face cards" (Jack, 


Queen, or King) The coin is a fair coin and is equally likely to land on 
Heads or Tails 


e If the card is a face card and the coin lands on Heads, you win $6 
e If the card is a face card and the coin lands on Tails, you win $2 


e If the card is not a face card, you lose $2, no matter what the coin 
shows. 


e A Find the expected value for this game (expected net gain or 
loss). 

¢ B Explain what your calculations indicate about your long-term 
average profits and losses on this game. 

e¢ C Should you play this game to win money? 


Solution: 


The variable of interest is X = net gain or loss, in dollars 


The face cards J, Q, K (Jack, Queen, King). There are(3)(4) = 12 face 
cards and 52 — 12 = 40 cards that are not face cards. 


We first need to construct the probability distribution for X. We use the 
card and coin events to determine the probability for each outcome, 
but we use the monetary value of X to determine the expected value. 


$X net gain or 


Card Event gee P(X) 
(12/52)(1/2) = 
Face Card and Heads 6 6/52 
(12/52)(1/2) = 
Face Card and Tails 2 6/52 
(Not Face Card) and (H -) (40/52)(1) = 


or T) 40/52 


e Expected value = (6)(6/52) + (2)(6/52) + (—2) (40/52) = —32/52 

e Expected value = —$0.62, rounded to the nearest cent 

e If you play this game repeatedly, over a long number of games, 
you would expect to lost 62 cents per game, on average. 

e You should not play this game to win money because the 
expected value indicates an expected average loss. 


Exercise: 
Problem: 
You buy a lottery ticket to a lottery that costs $10 per ticket. There are 
only 100 tickets available be sold in this lottery. In this lottery there is 


one $500 prize, 2 $100 prizes and 4 $25 prizes. Find your expected 
gain or loss. 


Solution: 


Start by writing the probability distribution. X is net gain or loss = 
prize (if any) less $10 cost of ticket 


X = $ net gain or loss P(X) 

$500-—$10=$490 1/100 
$100—$10=$90 2/100 
$25-$10=$15 4/100 


$0-$10=$-10 93/100) 


Expected Value = (490)(1/100) + (90)(2/100) + (15)(4/100) + (—10) 
(93/100) = —$2. There is an expected loss of $2 per ticket, on average. 


Exercise: 


Problem: 


A student takes a 10 question true-false quiz, but did not study and 
randomly guesses each answer. Find the probability that the student 
passes the quiz with a grade of at least 70% of the questions correct. 


Solution: 


e X =number of questions answered correctly 

e X~B(10, 0.5) 

e We are interested in AT LEAST 70% of 10 questions correct. 
70% of 10 is 7. We want to find the probability that X is greater 
than or equal to 7. The event "at least 7" is the complement of 
"less than or equal to 6". 

e Using your calculator's distribution menu: 1 — binomcdf(10, .5, 6) 
gives 0.171875 

e The probability of getting at least 70% of the 10 questions correct 
when randomly guessing is approximately 0.172 


Exercise: 


Problem: 


A student takes a 32 question multiple choice exam, but did not study 
and randomly guesses each answer. Each question has 3 possible 
choices for the answer. Find the probability that the student guesses 
more than 75% of the questions correctly. 


Solution: 


e X =number of questions answered correctly 

e X~B(82, 1/3) 

e We are interested in MORE THAN 75% of 32 questions correct. 
75% of 32 is 24. We want to find P(x>24). The event "more than 


24" is the complement of "less than or equal to 24". 

e Using your calculator's distribution menu: 1 - binomcdf(32, 1/3, 
24) 

e P(x>24) = 0.00000026761 

e The probability of getting more than 75% of the 32 questions 
correct when randomly guessing is very small and practically 
zero. 


Exercise: 


Problem: 


Suppose that you are perfoming the probability experiment of rolling 
one fair six-sided die. Let F be the event of rolling a"4" or a"5". You 
are interested in how many times you need to roll the die in order to 
obtain the first “4 or 5” as the outcome. 


e p= probability of success (event F occurs) 
e q = probability of failure (event F does not occur) 


e A Write the description of the random variable X. What are the 
values that X can take on? Find the values of p and q. 

e B Find the probability that the first occurrence of event F (rolling 
a “4” or “5”) is on the second trial. 

e C How many trials would you expect until you roll a “4” or “5”? 


Solution: 


A: X can take on the values 1, 2, 3, .... p = 2/6, gq = 4/6 
B20.2222 
Crees} 


**Exercises 38 - 43 contributed by Roberta Bloom 


Review 
This module provides a number of homework/review exercises 
summarizing topics related to Discrete Random Variables. 


The next two questions refer to the following: 


A recent poll concerning credit cards found that 35 percent of respondents 
use a credit card that gives them a mile of air travel for every dollar they 
charge. Thirty percent of the respondents charge more than $2000 per 
month. Of those respondents who charge more than $2000, 80 percent use a 
credit card that gives them a mile of air travel for every dollar they charge. 
Exercise: 


Problem: 


What is the probability that a randomly selected respondent will spend 
more than $2000 AND use a credit card that gives them a mile of air 
travel for every dollar they charge? 


(0.30) (0.35) 
(0.80)(0.35) 
(0.80) 
(0.80) 


0.80)(0.30) 
0.80 


A 
B 
C 
D 


Solution: 


C 
Exercise: 
Problem: 
Based upon the above information, are using a credit card that gives a 


mile of air travel for each dollar spent AND charging more than $2000 
per month independent events? 


e AYes 
e BNo, and they are not mutually exclusive either 
e CNo, but they are mutually exclusive 


¢ DNot enough information given to determine the answer 


Solution: 


B 
Exercise: 


Problem: 


A sociologist wants to know the opinions of employed adult women 
about government funding for day care. She obtains a list of 520 
members of a local business and professional women’s club and mails 
a questionnaire to 100 of these women selected at random. 68 
questionnaires are returned. What is the population in this study? 


e AAIll employed adult women 

e BAI] the members of a local business and professional women’s 
club 

e CThe 100 women who received the questionnaire 

e DAI! employed women with children 


Solution: 


A 


The next two questions refer to the following: An article from The San Jose 
Mercury News was concerned with the racial mix of the 1500 students at 
Prospect High School in Saratoga, CA. The table summarizes the results. 
(Male and female values are approximate.) Suppose one Prospect High 
School student is randomly selected. 


Ethnic 
Group 


Gender White Asian Hispanic Black 


Male 400 168 115 35 
Female 440 132 140 40 
Exercise: 


American 
Indian 


16 


14 


Problem: Find the probability that a student is Asian or Male. 


Solution: 


0.5773 
Exercise: 


Problem: 


Find the probability that a student is Black given that the student is 


Female. 


Solution: 


0.0522 
Exercise: 


Problem: 


A sample of pounds lost, in a certain month, by individual members of 


a weight reducing clinic produced the following statistics: 


e Mean = 5 lbs. 
e Median = 4.5 lbs. 
e Mode = 4 lbs. 


e Standard deviation = 3.8 lbs. 
e First quartile = 2 lbs. 
e Third quartile = 8.5 lbs. 


The correct statement is: 


e AOne fourth of the members lost exactly 2 pounds. 

e BThe middle fifty percent of the members lost from 2 to 8.5 lbs. 
¢ CMost people lost 3.5 to 4.5 Ibs. 

e¢ DAI! of the choices above are correct. 


Solution: 


B 
Exercise: 
Problem: 


What does it mean when a data set has a standard deviation equal to 
Zero? 


e AAIl values of the data appear with the same frequency. 
e BThe mean of the data is also zero. 

CAIl of the data have the same value. 

e DtThere are no data to begin with. 


Solution: 


C 


Exercise: 


Problem: The statement that best describes the illustration below is: 


e AThe mean is equal to the median. 
¢ BThere is no first quartile. 


e CThe lowest data value is the median. 


e DThe median equals (OF @8) 


Solution: 


C 

Exercise: 
Problem: 
According to a recent article (San Jose Mercury News) the average 
number of babies born with significant hearing loss (deafness) is 
approximately 2 per 1000 babies in a healthy baby nursery. The 


number climbs to an average of 30 per 1000 babies in an intensive care 
nursery. 


Suppose that 1000 babies from healthy baby nurseries were randomly 
surveyed. Find the probability that exactly 2 babies were born deaf. 


Solution: 


0.2709 
Exercise: 
Problem: 
A “friend” offers you the following “deal.” For a $10 fee, you may 


pick an envelope from a box containing 100 seemingly identical 
envelopes. However, each envelope contains a coupon for a free gift. 


¢ 10 of the coupons are for a free gift worth $6. 
¢ 80 of the coupons are for a free gift worth $8. 
¢ 6 of the coupons are for a free gift worth $12. 
e 4 of the coupons are for a free gift worth $40. 


Based upon the financial gain or loss over the long run, should you 
play the game? 


e AYes, I expect to come out ahead in money. 
e BNo, I expect to come out behind in money. 
e ClIt doesn’t matter. I expect to break even. 


Solution: 


B 


The next four questions refer to the following: Recently, a nurse 
commented that when a patient calls the medical advice line claiming to 
have the flu, the chance that he/she truly has the flu (and not just a nasty 
cold) is only about 4%. Of the next 25 patients calling in claiming to have 
the flu, we are interested in how many actually have the flu. 

Exercise: 


Problem: Define the Random Variable and list its possible values. 


Solution: 


X = the number of patients calling in claiming to have the flu, who 
actually have the flu. X = 0, 1, 2, ...25 


Exercise: 


Problem: State the distribution of X . 


Solution: 


B(25,0.04) 
Exercise: 


Problem: 


Find the probability that at least 4 of the 25 patients actually have the 
flu. 


Solution: 


0.0165 
Exercise: 
Problem: 


On average, for every 25 patients calling in, how many do you expect 
to have the flu? 


Solution: 


if 


The next two questions refer to the following: Different types of writing can 
sometimes be distinguished by the number of letters in the words used. A 
student interested in this fact wants to study the number of letters of words 
used by Tom Clancy in his novels. She opens a Clancy novel at random and 
records the number of letters of the first 250 words on the page. 

Exercise: 


Problem: What kind of data was collected? 
e Aqualitative 


e Bquantitative - continuous 
e Cquantitative — discrete 


Solution: 


C 


Exercise: 


Problem: What is the population under study? 
Solution: 


All words used by Tom Clancy in his novels 


Discrete Random Variables: Lab I (edited: Teegarden) 

This module allows students to explore concepts related to discrete random 
variables through the use of a simple playing card experiment. Students will 
compare empirical data to a theoretical distribution to determine if the 
experiment fist a discrete distribution. This lab involves the concept of 
long-term probabilities. Labs changed to incorporate mini-tabs. 


Discrete Probability Lab 


Name: 


Student Learning Outcomes: 


e The student will compare empirical data and a theoretical distribution 
to determine if everyday experiment fits a discrete distribution. 

e The student will demonstrate an understanding of long-term 
probabilities. 


Procedure: The experiment procedure is to pick one card from a deck of 
shuffled cards. 


1. The theoretical probability of picking a diamond from a deck is: 

2. Shuffle a deck of cards and pick one card from it and record whether it 
was a diamond or not a diamond. 

3. Put the card back and reshuffle. 

4. Do this a total of 10 times and record the number of diamonds picked. 

5. What is the experimental probability of drawing a diamond? 

6. How does the experimental probability compare to the theoretical 
probability? (high/low/about the same) 


Using Minitab, simulate this experiment (drawing a card 10 times and 


recording the number of diamonds) for a total of 50 times. Use Calc -> 
Random data -> Binomial. 


I Organize the Data: 


Summarize the data generated in Minitab and include determine both 
the frequency and relative frequency. Record the result here: 


xX Frequency Relative Frequency 


10 


2. Calculate the following using Minitab. (include the session window) 


lo s= 


3. Construct a bar chart of the experimental data using the relative 
frequency as the vertical axis and attach it to this cover sheet. Don’t forget a 
title and labels for the graph 


II. Theoretical Distribution 


1. Using Minitab, build the theoretical PDF chart for X based on the 
distribution in the section above. 


x P(X) 


10 


2. Calculate the following, indicating the formulas: 
————————————— 

3. Constuct a graph of the theoretical distribution by using: 

graph — probability distribution plot — single view — Binomial 


Attach the graph to this cover sheet. 


III. Using the Data 


Using the Theoretical probability table generated by Minitab, determine the 
following theoretical probabilities, rounding to 4 decimal places: 


P(X = 3)= P(2<X<5)= P(X > 8) 


Using the data from the Minitab simulation, determine the following 
empirical (experimental) probabilities: 


Px 3)= P(2<X<5)= P(X > 8) 


IV. Discussion Questions: 


Answer the following in complete sentences on a separate sheet of 
paper and attach it to this cover sheet. 


1. Knowing that data vary, describe two similarities between the graphs 
and distributions of the theoretical and experimental distributions. 

2. Describe the two most significant differences between the graphs or 
distributions of the theoretical and experimental distributions. 

3. Suppose that the experiment had been repeated 500 times. Would you 
expect the frequency table and bar chart in part I above to change? 


How and Why? Repeat the experiment and justify your answer. (Be 
sure to include the data summary and bar chart.) 


Continuous Random Variables 

Continuous Random Variables: Introduction is part of the collection 
col10555 written by Barbara Illowsky and Susan Dean and serves as an 
introduction to the uniform and exponential distributions with contributions 
from Roberta Bloom. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Recognize and understand continuous probability density functions in 
general. 

e Recognize the uniform probability distribution and apply it 
appropriately. 

e Recognize the exponential probability distribution and apply it 
appropriately. 


Introduction 


Continuous random variables have many applications. Baseball batting 
averages, IQ scores, the length of time a long distance telephone call lasts, 
the amount of money a person carries, the length of time a computer chip 
lasts, and SAT scores are just a few. The field of reliability depends on a 
variety of continuous random variables. 


This chapter gives an introduction to continuous random variables and the 
many continuous distributions. We will be studying these continuous 
distributions for several chapters. 


Note:The values of discrete and continuous random variables can be 
ambiguous. For example, if X is equal to the number of miles (to the 
nearest mile) you drive to work, then X is a discrete random variable. You 
count the miles. If X is the distance you drive to work, then you measure 
values of X and_X is a continuous random variable. How the random 
variable is defined is very important. 


Properties of Continuous Probability Distributions 


The graph of a continuous probability distribution is a curve. Probability is 
represented by area under the curve. 


The curve is called the probability density function (abbreviated: pdf). 
We use the symbol f() to represent the curve. f(x) is the function that 
corresponds to the graph; we use the density function f(a) to draw the 
graph of the probability distribution. 


Area under the curve is given by a different function called the 
cumulative distribution function (abbreviated: cdf). The cumulative 
distribution function is used to evaluate probability as area. 


e The outcomes are measured, not counted. 

e The entire area under the curve and above the x-axis is equal to 1. 

e Probability is found for intervals of x values rather than for individual 
x values. 

¢ P(c < x < d) is the probability that the random variable X is in the 
interval between the values c and d. P(c < x < d) is the area under 
the curve, above the x-axis, to the right of c and the left of d. 

e P(x = c) = 0 The probability that x takes on any single individual 
value is 0. The area below the curve, above the x-axis, and between 
x=c and x=c has no width, and therefore no area (area = 0). Since the 
probability is equal to the area, the probability is also 0. 


We will find the area that represents probability by using geometry, 
formulas, technology, or probability tables. In general, calculus is needed to 
find the area under the curve for many probability density functions. When 
we use formulas to find the area in this textbook, the formulas were found 
by using the techniques of integral calculus. However, because most 
students taking this course have not studied calculus, we will not be using 
calculus in this textbook. 


There are many continuous probability distributions. When using a 
continuous probability distribution to model probability, the distribution 
used is selected to best model and fit the particular situation. 


In this chapter and the next chapter, we will study the uniform distribution, 
the exponential distribution, and the normal distribution. The following 
graphs illustrate these distributions. 


Shaded Area 
represents 


P(3<x<6) 


x 


012345678 9 10 
The Uniform Distribution 


The graph shows a Uniform 
Distribution with the area between 
x=3 and x=6 shaded to represent 
the probability that the value of 
the random variable X is in the 
interval between 3 and 6. 


Shaded Area 
represents 
P(2<x<4) 


123 4 6 6 7 x 
The Exponential Distribution 


The graph shows an 
Exponential Distribution with 


the area between x=2 and x=4 
shaded to represent the 
probability that the value of the 
random variable X is in the 
interval between 2 and 4. 


Shaded area 
represents 
probability 


P(1<x<2) 


-3 -2 = 0 1 2 5 
The Normal Distribution x 


The graph shows the Standard Normal Distribution with 
the area between x=1 and x=2 shaded to represent the 
probability that the value of the random variable X is in 
the interval between 1 and 2. 


** With contributions from Roberta Bloom 


Glossary 


Uniform Distribution 
A continuous random variable (RV) that has equally likely outcomes 
over the domain, a < x < b. Often referred as the Rectangular 
distribution because the graph of the pdf has the form of a rectangle. 
Notation: X~U(a,b). The mean is p = ath and the standard deviation 


= p\2 
iso = / e > Y The probability density function is f(X) = 3 for 
a<a«<bora<z < b. The cumulative distribution is 

—_ «t~—a 
P(X <2#)=—. 

Exponential Distribution 

A continuous random variable (RV) that appears when we are 
interested in the intervals of time between some random events, for 
example, the length of time between emergency arrivals at a hospital. 
Notation: X~Exp(m). The mean is ps = —- and the standard deviation 


is o = +. The probability density function is f(x) = me“™, z > 0 
and the cumulative distribution function is P(X < #«) =1—e™. 


Continuous Probability Functions 

This module introduces the continuous probability function and explores 
the relationship between the probability of X and the area under the curve 
of f(X). 


We begin by defining a continuous probability density function. We use the 
function notation f(z). Intermediate algebra may have been your first 
formal introduction to functions. In the study of probability, the functions 
we study are special. We define the function f(x) so that the area between 
it and the x-axis is equal to a probability. Since the maximum probability is 
one, the maximum area is also one. 


For continuous probability distributions, PROBABILITY = AREA. 


Example: 
Consider the function f(x) = _ for 0 < x < 20. x = areal number. The 
graph of f(x) = = is a horizontal line. However, since 0 < z < 20, 


f(a) is restricted to the portion between x = 0 and x = 20, inclusive . 


] 
f(x) =— 
: 20 


> 


i) — an for 0< x <20. 
The graph of f(x) = 36 is a horizontal line segment when 0 < x < 20. 


The area between f(x) = ao where 0 < x < 20 and the x-axis is the area 
of a rectangle with base = 20 and height =3. 

AREA = 20-3 =1 

This particular function, where we have restricted x so that the area 
between the function and the x-axis is 1, is an example of a continuous 
probability density function. It is used as a tool to calculate probabilities. 
Suppose we want to find the area between f(x) = — and the x-axis 


where 0 < xz < 2. 


P| 
20 3 


NBCU es (0 10))) 2 ee (0) 

(2 — 0) = 2 = base of a rectangle 

35 = the height. 

The area corresponds to a probability. The probability that x is between 0 
and 2 is 0.1, which can be written mathematically as 

IE se) = EA ee 7))) = (0) IL. 

Suppose we want to find the area between f(x) = — and the x-axis 
where 4< 24 < 15. 


AREA = (15 — 4) - 3 = 0.55 

(15 — 4) = 11 = the base of a rectangle 

30 = the height. 

The area corresponds to the probability P(4 < 2 < 15) = 0.55. 
Suppose we want to find P(x =15 ). On an x-y graph, x=15 is a vertical 


line. A vertical line has no width (or 0 width). Therefore, 
P(x=15) = (base) (height) = (0) (55) = 0. 


f(x) 


P(X < «) (can be written as P(X < x) for continuous distributions) is 
called the cumulative distribution function or CDF. Notice the "less than 
or equal to" symbol. We can use the CDF to calculate P(X > x) . The 
CDF gives "area to the left" and P(X > x) gives "area to the right." We 
calculate P(X > x) for continuous distributions as follows: 

VA ie | I ZIO.G ce ta), 


f(x) 


x 
P(X < x) P(X > x) = 1-P(X < x) 


Label the graph with f(x) and . Scale the x and y axes with the maximum 


x and y values. f(x) = = 0<2 < 20. 


f(x) 


0 2.3 12.7 x 


P(2.3 < @ < 12.7) = (base)(height) = (12.7 — 2.3) (4) = 0.52 


The Uniform Distribution 

Continuous Random Variable: Uniform Distribution is part of the collection col10555 written by Barbara 
Illowsky and Susan Dean. It describes the properties of the Uniform Distribution with contributions from 
Roberta Bloom. 


Example: 

The previous problem is an example of the uniform probability distribution. 

Illustrate the uniform distribution. The data that follows are 55 smiling times, in seconds, of an eight- 
week old baby. 


10.4 19.6 18.8 13.9 17.8 16.8 21.6 17.9 IDES 11.1 4.9 
12.8 14.8 22.8 20.0 15.9 16.3 13.4 17.1 14.5 19.0 22.8 
ES 0.7 8.9 11.9 10.9 Tee) 5.9 3.7 WES 19.2 9.8 
5.8 6.9 2.6 5.8 21.7 11.8 3.4 2.1 4.5 6.3 10.7 


8.9 9.4 9.4 7.6 10.0 3.3 6.7 7.8 11.6 13.8 18.6 


sample mean = 11.49 and sample standard deviation = 6.23 

We will assume that the smiling times, in seconds, follow a uniform distribution between 0 and 23 
seconds, inclusive. This means that any smiling time from 0 to and including 23 seconds is equally likely. 
The histogram that could be constructed from the sample is an empirical distribution that closely matches 
the theoretical uniform distribution. 

Let X = length, in seconds, of an eight-week old baby's smile. 

The notation for the uniform distribution is 

X ~ U(a,b) where a = the lowest value of z and b = the highest value of a. 

The probability density function is f(z) = a fora<a<b. 


For this example, x ~ U(0,23) and f(x) = s3+5 for 0< x <23. 


Formulas for the theoretical mean and standard deviation are 


—q)2 
y= Hand o = yf 88 
For this problem, the theoretical mean and standard deviation are 


ae 23—0)2 
— oes = 11.50 seconds and o = ee = 6.64 seconds 
Notice that the theoretical mean and standard deviation are close to the sample mean and standard 


deviation. 


Example: 
Exercise: 


Problem: 


What is the probability that a randomly chosen eight-week old baby smiles between 2 and 18 
seconds? 


Solution: 
Find P(2< a < 18): 


P(2 < @ < 18) = (base)(height) = (18 — 2) - = = <2. 


f(x) 


Exercise: 


Problem: Find the 90th percentile for an eight week old baby's smiling time. 

Solution: 

Ninety percent of the smiling times fall below the 90th percentile, k, so P(x < k) = 0.90 
P(e =k) — 0:90 

(base) (height) = 0.90 

(k— 0) - # =0.90 


k= 23+ 0.90 = 20:7 


f(x) AREA = P(X <k) = 0.90 


Exercise: 


Problem: 


Find the probability that a random eight week old baby smiles more than 12 seconds KNOWING 
that the baby smiles MORE THAN 8 SECONDS. 


Solution: 


Find P(x > 12|x > 8) There are two ways to do the problem. For the first way, use the fact that 
this is a conditional and changes the sample space. The graph illustrates the new sample space. You 
already know the baby smiled more than 8 seconds. 


Write anew f(z): f(z) = 5 = is 


tor << a < WS} 


P(a > 12|2 > 8) = (23-12). + =# 


f(x) 1 


0 8 12 ys a 
For the second way, use the conditional formula from Probability Topics with the original 
distribution X ~ U(0, 23): 


P(A|B) = eee For this problem, A is (2 > 12) and Bis (x > 8). 


So, P(x > 12|e > 8) = “ESR = pas = BE = 0.733 


02 4 6 8 10 12 14 16 18 20 22 24 x 


Example: 

Uniform: The amount of time, in minutes, that a person must wait for a bus is uniformly distributed 
between 0 and 15 minutes, inclusive. 

Exercise: 


Problem: What is the probability that a person waits fewer than 12.5 minutes? 


Solution: 


Let X = the number of minutes a person must wait for a bus. a = 0 and b= 15. x~U(0, 15). Write 
the probability density function. f(z) = —— = + for 0< x <15. 


15—0 15 
Find P(a < 12.5). Draw a graph. 
P(z < k) = (base) (height) = (12.5 — 0) - = 0.8333 


The probability a person waits less than 12.5 minutes is 0.8333. 


f(x) 
1 


Exercise: 


Problem: On the average, how long must a person wait? 


Find the mean, pz, and the standard deviation, o. 


Solution: 


jy = OBES yo 7.5. On the average, a person must wait 7.5 minutes. 


2 
oc ne = / ——— = 4.3. The Standard deviation is 4.3 minutes. 


Exercise: 


Problem: Ninety percent of the time, the time a person must wait falls below what value? 


Note: This asks for the 90th percentile. 


Solution: 


Find the 90th percentile. Draw a graph. Let k = the 90th percentile. 


P(x < k) = (base) (height) = (k — 0) - (+) 


0.90 =k- +* 
k= (0:90)(15)— 135 
k is sometimes called a critical value. 


The 90th percentile is 13.5 minutes. Ninety percent of the time, a person must wait at most 13.5 
minutes. 


f(x) AREA = P(x <k) = 0.90 
+ eter | 
15 
0 _ = 
Example: 


Uniform: Suppose the time it takes a nine-year old to eat a donut is between 0.5 and 4 minutes, inclusive. 
Let X = the time, in minutes, it takes a nine-year old child to eat a donut. Then X ~ U(0.5, 4). 
Exercise: 


Problem: 


The probability that a randomly selected nine-year old child eats a donut in at least two minutes is 


Solution: 


0.5714 
Exercise: 
Problem: 


Find the probability that a different nine-year old child eats a donut in more than 2 minutes given that 
the child has already been eating the donut for more than 1.5 minutes. 


The second probability question has a conditional (refer to "Probability Topics"). You are asked to 
find the probability that a nine-year old child eats a donut in more than 2 minutes given that the child 
has already been eating the donut for more than 1.5 minutes. Solve the problem two different ways 
(see the first example). You must reduce the sample space. First way: Since you already know the 
child has already been eating the donut for more than 1.5 minutes, you are no longer starting at 

a = 0.5 minutes. Your starting point is 1.5 minutes. 


Write a new f(x): 


iG) = T= for l5< 2 <4. 


Find P(a > 2|a > 1.5). Draw a graph. 


f(x) 


P(x > 2|a > 1.5) = (base)(new height) = (4 — 2)(2/5) =? 


Solution: 


4 


5 


The probability that a nine-year old child eats a donut in more than 2 minutes given that the child has 


already been eating the donut for more than 1.5 minutes is z. 


Second way: Draw the original graph for z ~ U(0.5, 4). Use the conditional formula 


_ P(z>2ANDa>15) _ P(a>2) = _ a 


s 
of 


Note:See "Summary of the Uniform and Exponential Probability Distributions" for a full summary. 


Example: 
Uniform: Ace Heating and Air Conditioning Service finds that the amount of time a repairman needs to 


fix a furnace is uniformly distributed between 1.5 and 4 hours. Let x = the time needed to fix a furnace. 
Then « ~ U(1.5, 4). 


1. Find the problem that a randomly selected furnace repair requires more than 2 hours. 

2. Find the probability that a randomly selected furnace repair requires less than 3 hours. 

3. Find the 30th percentile of furnace repair times. 

4. The longest 25% of repair furnace repairs take at least how long? (In other words: Find the 
minimum time for the longest 25% of repair times.) What percentile does this represent? 

5. Find the mean and standard deviation 


Exercise: 


Problem: Find the probability that a randomly selected furnace repair requires longer than 2 hours. 
Solution: 


Toting 7 (a.)o9 (a) — an = ¥e so f(x) =0.4 


P(x>2) = (base)(height) = (4 — 2)(0.4) = 0.8 


Example 4 Figure 1 
f(x) P(x>2) 


0.4 


Uniform Distribution 
between 1.5 and 4 with 
shaded area between 2 and 4 
representing the probability 
that the repair time x is 
greater than 2 


Exercise: 


Problem: 


Find the probability that a randomly selected furnace repair requires less than 3 hours. Describe how 
the graph differs from the graph in the first part of this example. 


Solution: 
P(a < 3) = (base)(height) = (3 — 1.5)(0.4) = 0.6 
The graph of the rectangle showing the entire distribution would remain the same. However the 


graph should be shaded between x=1.5 and x=3. Note that the shaded area starts at x=1.5 rather than 
at x=0; since X~U(1.5,4), x can not be less than 1.5. 


Example 4 Figure 2 
f(x) P(x<3) 
0.4 


o Wek Fg «a 


Uniform Distribution 
between 1.5 and 4 with 
shaded area between 1.5 and 
3 representing the probability 
that the repair time x is less 
than 3 


Exercise: 


Problem: Find the 30th percentile of furnace repair times. 


Solution: 


Example 4 Figure 3 


(x) Area = P(X<k) = 0.3 


Ra 
0.4 


0 15° k 4 x 


Uniform Distribution between 

1.5 and 4 with an area of 0.30 

shaded to the left, representing 
the shortest 30% of repair times. 


P(a < k) = 0.30 
P(x < k) = (base)(height) = (k — 1.5) - (0.4) 


e 0.3 = (k — 1.5) (0.4) ; Solve to find k: 
e 0.75 =k-— 1.5, obtained by dividing both sides by 0.4 
e k= 2.25, obtained by adding 1.5 to both sides 
The 30th percentile of repair times is 2.25 hours. 30% of repair times are 2.5 hours or less. 


Exercise: 


Problem: 


The longest 25% of furnace repair times take at least how long? (Find the minimum time for the 
longest 25% of repairs.) 


Solution: 
Example 4 Figure 4 
f(x) Area=P(X>k) = 0.25 
Vv 
0.4 
0 15 k 4 x 


Uniform Distribution between 
1.5 and 4 with an area of 0.25 
shaded to the right representing 
the longest 25% of repair times. 


P(x > k) = 0.25 
P(x > k) = (base)(height) = (4 — k) - (0.4) 


e 0.25 = (4 — k)(0.4) ; Solve for k: 


e 0.625 = 4—k, obtained by dividing both sides by 0.4 
e —3.375 = —-k, obtained by subtracting 4 from both sides 
e k=3.375 


The longest 25% of furnace repairs take at least 3.375 hours (3.375 hours or longer). 


Note: Since 25% of repair times are 3.375 hours or longer, that means that 75% of repair times are 
3.375 hours or less. 3.375 hours is the 75th percentile of furnace repair times. 


Exercise: 


Problem: Find the mean and standard deviation 


Solution: 
— a+b = (b—a)? 
p= 73 ando = o 


= Ja = 2.75 hours and o = + ae = 0.7217 hours 


Note:See "Summary of the Uniform and Exponential Probability Distributions" for a full summary. 


**Example 5 contributed by Roberta Bloom 


Glossary 


Conditional Probability 
The likelihood that an event will occur given that another event has already occurred. 


Uniform Distribution 
A continuous random variable (RV) that has equally likely outcomes over the domain, a < x < b. 
Often referred as the Rectangular distribution because the graph of the pdf has the form of a 
(b-a)? 
12 


rectangle. Notation: X~U(a,b). The mean is up = ath and the standard deviation is 0 = / 
The probability density function is f(x) = ta fora <x < bora< x <b. The cumulative 


distribution is P(X < x) = =—*. 


The Exponential Distribution 
This module introduces the properties of the exponential distribution, the behavior of probabilities that reflect a 
large number of small values and a small number of high values. 


The exponential distribution is often concerned with the amount of time until some specific event occurs. For 
example, the amount of time (beginning now) until an earthquake occurs has an exponential distribution. Other 
examples include the length, in minutes, of long distance business telephone calls, and the amount of time, in 
months, a car battery lasts. It can be shown, too, that the value of the change that you have in your pocket or 
purse approximately follows an exponential distribution. 


Values for an exponential random variable occur in the following way. There are fewer large values and more 
small values. For example, the amount of money customers spend in one trip to the supermarket follows an 
exponential distribution. There are more people that spend less money and fewer people that spend large 
amounts of money. 


The exponential distribution is widely used in the field of reliability. Reliability deals with the amount of time a 
product lasts. 


Example: 

Illustrates the exponential distribution: Let X = amount of time (in minutes) a postal clerk spends with 
his/her customer. The time is known to have an exponential distribution with the average amount of time equal 
to 4 minutes. 

X is a continuous random variable since time is measured. It is given that jz = 4 minutes. To do any 
calculations, you must know m, the decay parameter. 

i. = = Therefore, m = t = (5 

The standard deviation, o, is the same as the mean. pp = 0 

The distribution notation is X~Exp(m). Therefore, X~Exp(0.25). 

The probability density function is f(a) = m-e™* The number e = 2.71828182846... It is a number that is 
used often in mathematics. Scientific calculators have the key "e*." If you enter 1 for z, the calculator will 
display the value e. 

The curve is: 

Hi) = 0.25 - e~ °-5* where z is at least 0 and m = 0.25. 

For example, f(8) = 0.25-e 9755 — 9.072 


The graph is as follows: 
f(x) 
0.25, m= 0.25 
0.27 ' 
0.157 
0.1) \ 
0.05 i : 
O}—4+4 ++ +4 4 4H 
02 4 6 8 101214161820 
x 
w=4 


Notice the graph is a declining curve. When z = 0, 
Vc OO Gc etre. 0.250.257 


Example: 
Exercise: 


Problem: Find the probability that a clerk spends four to five minutes with a randomly selected customer. 
Solution: 

Find P(4 < x < 5). 

The cumulative distribution function (CDF) gives the area to the left. 

Pat) = 

P@ 25) =) —e* — 07135 and P(e <4) — 1 eo — 0.6321 


f(x) P(4 < x <5) 
0.25 


Note: You can do these calculations easily on a calculator. 


The probability that a postal clerk spends four to five minutes with a randomly selected customer is 


P(4 <a <5) = P(a <5) — P(x < 4) = 0.7135 — 0.6321 = 0.0814 


Note: TI-83+ and TI-84: On the home screen, enter (1-e/((-.25*5))-(1-e/(-.25*4)) or enter eA(-.25*4)- 
eA(-.25*5),. 


Exercise: 


Problem: Half of all customers are finished within how long? (Find the 50th percentile) 


Solution: 


Find the 50th percentile. 


f(x) P(x <k)=0.50 


P(a < k) = 0.50, & = 2.8 minutes (calculator or computer) 
Half of all customers are finished within 2.8 minutes. 

You can also do the calculation as follows: 

P(e <k) = 0.50 and P(e <k) =1—e9?** 

Therefore, 0.50 = 1 — e 9° and e °5* = 1 — 0.50 = 0.5 


Take natural logs: In(e~°?°*) = In(0.50). So, —0.25-k = In(0.50) 


In(.50) 
~0.25 


Solve for k: k = = 2.8 minutes 


LN(1—AreaToTheLeft) 


—m 


Note:A formula for the percentile k is k = where LN is the natural log. 


Note: TI-83+ and TI-84: On the home screen, enter LN(1-.50)/-.25. Press the (-) for the negative. 


Exercise: 


Problem: Which is larger, the mean or the median? 
Solution: 
Is the mean or median larger? 


From part b, the median or 50th percentile is 2.8 minutes. The theoretical mean is 4 minutes. The mean is 
larger. 


Optional Collaborative Classroom Activity 


Have each class member count the change he/she has in his/her pocket or purse. Your instructor will record the 
amounts in dollars and cents. Construct a histogram of the data taken by the class. Use 5 intervals. Draw a 
smooth curve through the bars. The graph should look approximately exponential. Then calculate the mean. 


Let X = the amount of money a student in your class has in his/her pocket or purse. 


The distribution for X is approximately exponential with mean, pz = and m = . The standard 
deviation, o = 


Draw the appropriate exponential graph. You should label the x and y axes, the decay rate, and the mean. Shade 


the area that represents the probability that one student has less than $.40 in his/her pocket or purse. (Shade 
P(x < 0.40)). 


Example: 


On the average, a certain computer part lasts 10 years. The length of time the computer part lasts is 
exponentially distributed. 
Exercise: 


Problem: What is the probability that a computer part lasts more than 7 years? 
Solution: 

Let x = the amount of time (in years) a computer part lasts. 

= 10som= = = + —0.1 

Find P(« > 7). Draw a graph. 

Pia) — 1 P(e): 

Since P(X <2) = 1 —e" ther P(X >a) =1—(l—2e" *)=e"" 


2S) — e917 — 0.4966. The probability that a computer part lasts more than 7 years is 0.4966. 


Note: TI-83+ and TI-84: On the home screen, enter e/(-.1*7). 


f(x) P(x > 7) 
0.1 
0 7 x 
u= 10 
Exercise: 


Problem: On the average, how long would 5 computer parts last if they are used one after another? 


Solution: 


On the average, 1 computer part lasts 10 years. Therefore, 5 computer parts, if they are used one right after 
the other would last, on the average, 


(5)(10) = 50 years. 

Exercise: 
Problem: Eighty percent of computer parts last at most how long? 
Solution: 


Find the 80th percentile. Draw a graph. Let & = the 80th percentile. 


f(x) P(x < k) = 0.80 
0.1 


In(1-.80) 


Solve for k: k = —>y 


= 16.1 years 


Eighty percent of the computer parts last at most 16.1 years. 


Note: TI-83+ and TI-84: On the home screen, enter LN(1 - .80)/-.1 


Exercise: 


Problem: What is the probability that a computer part lasts between 9 and 11 years? 
Solution: 


Find P(9 < « < 11). Drawa graph. 


f(x) PQ <x<11) 
0.1 


P(Q < a < 11) = P(x < 11) — P(@ < 9) = (1—e °™) — (1-e ©*”) = 0.6671 — 0.5934 = 0.0737 
. (calculator or computer) 


The probability that a computer part lasts between 9 and 11 years is 0.0737. 


Note: TI-83+ and TI-84: On the home screen, enter e/(-.1*9) - e\(-.1*11). 


Example: 
Suppose that the length of a phone call, in minutes, is an exponential random variable with decay parameter = 


> If another person arrives at a public telephone just before you, find the probability that you will have to 
wait more than 5 minutes. Let X = the length of a phone call, in minutes. 
Exercise: 


Problem: What is m, jz, and o? The probability that you must wait more than 5 minutes is 


Solution: 


P(x > 5) = 0.6592 


Note:A summary for exponential distribution is available in "Summary of The Uniform and Exponential 
Probability Distributions". 


Glossary 


Exponential Distribution 
A continuous random variable (RV) that appears when we are interested in the intervals of time between 
some random events, for example, the length of time between emergency arrivals at a hospital. Notation: 
X~Exp(m). The mean is pp = = and the standard deviation is 0 = 4. The probability density function is 


f(x) =me"™, x > 0 and the cumulative distribution function is P(X < «) = 1—e™™. 


Summary of the Uniform and Exponential Probability Distributions 

This module provides a summary of formulas and definitions related to Continuous Random 
Variables. 

Formula 

Uniform 


X =a real number between a and b (in some instances, X can take on the values a and 6). a = 
smallest X ; b = largest X 


X ~U(a,b) 


The mean is pp = ate 


rar (b—a)? 
The standard deviation is o = 5 


Probability density function: f(X) = j+- fora < X <b 


Area to the Left of x: P(X < x) = (base) (height) 
Area to the Right of x: P(X > x) = (base) (height) 


Area Between c and d: P(c < X < d) = (base)(height) = (d — c)(height). 
Formula 
Exponential 


X ~ Exp(m) 

X =a real number, 0 or larger. m = the parameter that controls the rate of decay or decline 
The mean and standard deviation are the same. 

pao= ay and m = = - 

The probability density function: f(x ) =m-e™% X >0 
Area to the Left of x: P(X < x) =1—e™* 

Area to the Right of x: P(X > x) = e™* 


Area Between c and d: 
P(e< X<d)=P(X <d)—-P(X <c)=(1-e ™)-(l1-e ™)=e™-e™ 


LN(1-AreaToTheLeft) 


Percentile, k: k = a 


Practice 1: Uniform Distribution 
In this module the student will explore the properties of data with a uniform 
distribution. 


Student Learning Outcomes 


e The student will analyze data following a uniform distribution. 


Given 


The age of cars in the staff parking lot of a suburban college is uniformly 
distributed from six months (0.5 years) to 9.5 years. 


Describe the Data 


Exercise: 


Problem: What is being measured here? 


Solution: 


The age of cars in the staff parking lot 


Exercise: 


Problem: In words, define the Random Variable X. 


Solution: 


X = The age (in years) of cars in the staff parking lot 


Exercise: 


Problem: Are the data discrete or continuous? 


Solution: 


Continuous 


Exercise: 


Problem: The interval of values for z is: 


Solution: 


0.5-9.5 


Exercise: 


Problem: The distribution for X is: 


Solution: 


X ~U(0.5,9.5) 


Probability Distribution 


Exercise: 


Problem: Write the probability density function. 


Solution: 


Exercise: 


Problem: Graph the probability distribution. 


e aSketch the graph of the probability distribution. 


e bldentify the following values: 


iLowest value for x: 
iiHighest value for z: 
iiiHeight of the rectangle: 
ivLabel for x-axis (words): 
vLabel for y-axis (words): 


oOo 0 0 0 


Solution: 


e b.i0.5 

e b.ii9.5 

° b.iiiZ 

e b.ivAge of Cars 
e bv f(z) 


Random Probability 


Exercise: 


Problem: 


Find the probability that a randomly chosen car in the lot was less than 
4 years old. 


e aSketch the graph. Shade the area of interest. 


¢ bFind the probability. P(a < 4) = 


Solution: 
: 3.5 
b. 5 
Exercise: 
Problem: 


Out of just the cars less than 7.5 years old, find the probability that a 
randomly chosen car in the lot was less than 4 years old. 


e aSketch the graph. Shade the area of interest. 


¢ bFind the probability. P(a < 4| a < 7.5) = 


Solution: 


3.5 
*.Dr=, 


Exercise: 
Discussion Question 


Problem: 
What has changed in the previous two problems that made the 
solutions different? 

Quartiles 


Exercise: 


Problem: Find the average age of the cars in the lot. 
Solution: 
b=5 

Exercise: 


Problem: 


Find the third quartile of ages of cars in the lot. This means you will 
have to find the value such that . or 75%, of the cars are at most (less 
than or equal to) that age. 


e aSketch the graph. Shade the area of interest. 


e bFind the value k such that P(a < k) = 0.75. 
e cThe third quartile is: 


Solution: 


© bk = 7.25 


Practice 2: Exponential Distribution 
In this module the student will explore the properties of data with an 
exponential distribution. 


Student Learning Outcomes 


e The student will analyze data following the exponential distribution. 


Given 
Carbon-14 is a radioactive element with a half-life of about 5730 years. 
Carbon-14 is said to decay exponentially. The decay rate is 0.000121 . We 


start with 1 gram of carbon-14. We are interested in the time (years) it takes 
to decay carbon-14. 


Describe the Data 


Exercise: 


Problem: What is being measured here? 


Exercise: 


Problem: Are the data discrete or continuous? 


Solution: 


Continuous 


Exercise: 


Problem: In words, define the Random Variable X. 


Solution: 


X = Time (years) to decay carbon-14 


Exercise: 


Problem: What is the decay rate (m)? 


Solution: 


m = 0.000121 


Exercise: 


Problem: The distribution for X is: 


Solution: 


X ~ Exp(0.000121) 


Probability 


Exercise: 


Problem: 


Find the amount (percent of 1 gram) of carbon-14 lasting less than 
5730 years. This means, find P(x < 5730). 


e aSketch the graph. Shade the area of interest. 


¢ bFind the probability. P(a < 5730) = 


Solution: 


¢ bP(x < 5730) = 0.5001 
Exercise: 
Problem: 
Find the percentage of carbon-14 lasting longer than 10,000 years. 


e aSketch the graph. Shade the area of interest. 


e bFind the probability. P(2 > 10000) = 


Solution: 
¢ bP(x > 10000) = 0.2982 
Exercise: 
Problem: 
Thirty percent (30%) of carbon-14 will decay within how many years? 


e aSketch the graph. Shade the area of interest. 


e bFind the value k such that P(x < k) = 0.30. 


Solution: 


¢ bk = 2947.73 


Homework 
This module provides a number of homework exercises related to 
Continuous Random Variables. 


For each probability and percentile problem, DRAW THE PICTURE! 
Exercise: 


Problem: 


Consider the following experiment. You are one of 100 people enlisted 
to take part in a study to determine the percent of nurses in America 
with an R.N. (registered nurse) degree. You ask nurses if they have an 
R.N. degree. The nurses answer “yes” or “no.” You then calculate the 
percentage of nurses with an R.N. degree. You give that percentage to 
your supervisor. 


e¢ aWhat part of the experiment will yield discrete data? 
e b What part of the experiment will yield continuous data? 


Exercise: 


Problem: 


When age is rounded to the nearest year, do the data stay continuous, 
or do they become discrete? Why? 


Exercise: 


Problem: 


Births are approximately uniformly distributed between the 52 weeks 
of the year. They can be said to follow a Uniform Distribution from 1 — 
53 (spread of 52 weeks). 


ea X~- 

e b Graph the probability distribution. 
ec f(z) — 

e d pt = 

ee eo = 


e f Find the probability that a person is born at the exact moment 
week 19 starts. That is, find P(a = 19) = 

© gP(2<2<31)= 

e h Find the probability that a person is born after week 40. 

eR P22 | ee 28) 

e j Find the 70th percentile. 

e k Find the minimum for the upper quarter. 


Solution: 


© a X-U(1,53) 
e272) = gy Where 1 <a < 53 


Exercise: 


Problem: 


A random number generator picks a number from 1 to 9 in a uniform 
manner. 


ea X~- 

¢ b Graph the probability distribution. 
ecf(z)= 

ed KS 

e eo = 

et P30 ox (7.25) > 

oP 5.67) = 
*hP(a@>5|e¢>3)= 


e i Find the 90th percentile. 
Exercise: 
Problem: 


The time (in minutes) until the next bus departs a major bus depot 
follows a distribution with f(a) = 35 where x goes from 25 to 45 
minutes. 


e aDefine the random variable. X = 


« bxX-~ 

e c Graph the probability distribution. 

e d The distribution is (name of distribution). It is 
(discrete or continuous). 

eeu= 

efor 


e g Find the probability that the time is at most 30 minutes. Sketch 
and label a graph of the distribution. Shade the area of interest. 
Write the answer in a probability statement. 

e h Find the probability that the time is between 30 and 40 minutes. 
Sketch and label a graph of the distribution. Shade the area of 
interest. Write the answer in a probability statement. 

1 POQ5°< ae < 55) = . State this in a probability 
statement (similar to g and h ), draw the picture, and find the 
probability. 

e j Find the 90th percentile. This means that 90% of the time, the 
time is less than minutes. 

e k Find the 75th percentile. In a complete sentence, state what this 
means. (See j.) 

e | Find the probability that the time is more than 40 minutes given 
(or knowing that) it is at least 30 minutes. 


Solution: 


e b X-U(25,45) 
e d uniform; continuous 


e e 35 minutes 
e £5.8 minutes 
e g0.25 

e ho.5 

e il 

e j 43 minutes 
e k 40 minutes 
e 10.3333 


Exercise: 


Problem: 


According to a study by Dr. John McDougall of his live-in weight loss 
program at St. Helena Hospital, the people who follow his program 
lose between 6 and 15 pounds a month until they approach trim body 
weight. Let’s suppose that the weight loss is uniformly distributed. We 
are interested in the weight loss of a randomly selected individual 
following the program for one month. (Source: The McDougall 
Program for Maximum Weight Loss by John A. McDougall, M.D.) 


e aDefine the random variable. X = 

© bxX-~ 

e c Graph the probability distribution. 

ed f(z) = 

ee pL == 

e@ f Oo — 

e g Find the probability that the individual lost more than 10 
pounds in a month. 

e h Suppose it is known that the individual lost more than 10 
pounds in a month. Find the probability that he lost less than 12 
pounds in the month. 


ae ed i caer a a ra . State this ina 
probability question (similar to g and h), draw the picture, and 
find the probability. 


Exercise: 


Problem: 


A subway train on the Red Line arrives every 8 minutes during rush 
hour. We are interested in the length of time a commuter must wait for 
a train to arrive. The time follows a uniform distribution. 


e aDefine the random variable. X = 
e bX-~- 
e c Graph the probability distribution. 


© d f(z) = 


e g Find the probability that the commuter waits less than one 
minute. 

e h Find the probability that the commuter waits between three and 
four minutes. 

e 160% of commuters wait more than how long for the train? State 
this in a probability question (similar to g and h), draw the 
picture, and find the probability. 


Solution: 


where0 <2 <8 


Exercise: 


Problem: 


The age of a first grader on September 1 at Garden Elementary School 
is uniformly distributed from 5.8 to 6.8 years. We randomly select one 
first grader from the class. 


aDefine the random variable. X = 
b X~ 

c Graph the probability distribution. 
d f(x) = 


g Find the probability that she is over 6.5 years. 

h Find the probability that she is between 4 and 6 years. 

i Find the 70th percentile for the age of first graders on September 
1 at Garden Elementary School. 


Exercise: 


Problem: Let X ~Exp(0.1) 


a decay rate= 

bu= 

c Graph the probability distribution function. 

d On the above graph, shade the area corresponding to P(x < 6) 
and find the probability. 

e Sketch a new graph, shade the area corresponding to 

P(3 < x < 6) and find the probability. 

f Sketch a new graph, shade the area corresponding to P(x > 7) 
and find the probability. 

g Sketch a new graph, shade the area corresponding to the 40th 
percentile and find the value. 

h Find the average value of z. 


Solution: 


e a0.1 

« b 10 

e d 0.4512 
e e 0.1920 
e {0.4966 
e g5.11 

e h10 


Exercise: 


Problem: 


Suppose that the length of long distance phone calls, measured in 
minutes, is known to have an exponential distribution with the average 
length of a call equal to 8 minutes. 


e aDefine the random variable. X = 

e bIs X continuous or discrete? 

°c X~- 

ed b= 

e eo = 

e f Draw a graph of the probability distribution. Label the axes. 

e g Find the probability that a phone call lasts less than 9 minutes. 

e h Find the probability that a phone call lasts more than 9 minutes. 

e i Find the probability that a phone call lasts between 7 and 9 
minutes. 

e j If 25 phone calls are made one after another, on average, what 
would you expect the total to be? Why? 


Exercise: 
Problem: 


Suppose that the useful life of a particular car battery, measured in 
months, decays with parameter 0.025. We are interested in the life of 
the battery. 


e aDefine the random variable. X = 


b Is X continuous or discrete? 

c X~ 

d On average, how long would you expect 1 car battery to last? 

e On average, how long would you expect 9 car batteries to last, if 
they are used one after another? 

e f Find the probability that a car battery lasts more than 36 months. 
¢ g 70% of the batteries last at least how long? 


Solution: 


¢ c X~Exp(0.025) 
e d 40 months 

e 360 months 
f0.4066 

e g 14.27 


Exercise: 


Problem: 


The percent of persons (ages 5 and older) in each state who speak a 
language at home other than English is approximately exponentially 
distributed with a mean of 9.848 . Suppose we randomly pick a state. 
(Source: Bureau of the Census, U.S. Dept. of Commerce) 


e aDefine the random variable. X = 

e bIs X continuous or discrete? 

°c X~ 

ed i 

ecg= 

f Draw a graph of the probability distribution. Label the axes. 
g Find the probability that the percent is less than 12. 

h Find the probability that the percent is between 8 and 14. 

i The percent of all individuals living in the United States who 
speak a language at home other than English is 13.8 . 


o i Why is this number different from 9.848%? 


© ji What would make this number higher than 9.848%? 


Exercise: 


Problem: 


The time (in years) after reaching age 60 that it takes an individual to 
retire is approximately exponentially distributed with a mean of about 
5 years. Suppose we randomly pick one retired individual. We are 
interested in the time after age 60 to retirement. 


e aDefine the random variable. X = 

¢ bIs X continuous or discrete? 

°c X~- 

ed Lis 

e e Oo —s 

e f Draw a graph of the probability distribution. Label the axes. 

e g Find the probability that the person retired after age 70. 

¢ h Do more people retire before age 65 or after age 65? 

¢ iInaroom of 1000 people over age 80, how many do you expect 
will NOT have retired yet? 


Solution: 


¢ c X-Exp(<) 
e d5 

ee5 

¢ 90.1353 

e h Before 

e 118.3 


Exercise: 
Problem: 


The cost of all maintenance for a car during its first year is 
approximately exponentially distributed with a mean of $150. 


aDefine the random variable. X = 

b X~ 

Ch= 

da= 

e e Draw a graph of the probability distribution. Label the axes. 
e f Find the probability that a car required over $300 for 
maintenance during its first year. 


Try these multiple choice problems 


The next three questions refer to the following information. The average 
lifetime of a certain new cell phone is 3 years. The manufacturer will 
replace any cell phone failing within 2 years of the date of purchase. The 
lifetime of these cell phones is known to follow an exponential distribution. 
Exercise: 


Problem: The decay rate is 


¢ A 0.3333 
e B0.5000 
e C 2.0000 
¢ D 3.0000 


Solution: 


A 
Exercise: 


Problem: 


What is the probability that a phone will fail within 2 years of the date 
of purchase? 


e A 0.8647 
e B 0.4866 


e C 0.2212 
e d 0.9997 


Solution: 


B 


Exercise: 


Problem: What is the median lifetime of these phones (in years)? 


e A0.1941 
e B 1.3863 
e C 2.0794 
e D5.5452 


Solution: 


C 


The next three questions refer to the following information. The Sky 
Train from the terminal to the rental car and long term parking center is 
supposed to arrive every 8 minutes. The waiting times for the train are 
known to follow a uniform distribution. 

Exercise: 


Problem: What is the average waiting time (in minutes)? 


e A 0.0000 
e B 2.0000 
e« C 3.0000 
¢ D 4.0000 


Solution: 


D 


Exercise: 


Problem: Find the 30th percentile for the waiting times (in minutes). 


e A 2.0000 
e B 2.4000 
e C 2.750 
e D 3.000 


Solution: 


B 
Exercise: 


Problem: 


The probability of waiting more than 7 minutes given a person has 
waited more than 4 minutes is? 


e A 0.1250 
e B 0.2500 
e € 0.5000 
e D 0.7500 


Solution: 


B 


Review 
This module provides a number of homework/review problems related to 
Continuous Random Variables. 


[link] — [link] refer to the following study: A recent study of mothers of 
junior high school children in Santa Clara County reported that 76% of the 
mothers are employed in paid positions. Of those mothers who are 
employed, 64% work full-time (over 35 hours per week), and 36% work 
part-time. However, out of all of the mothers in the population, 49% work 
full-time. The population under study is made up of mothers of junior high 
school children in Santa Clara County. 


Let & =employed, Let F' =full-time employment 
Exercise: 


Problem: 
e a Find the percent of all mothers in the population that NOT 
employed. 
e b Find the percent of mothers in the population that are employed 
part-time. 


Solution: 


e a 24% 
e b27% 


Exercise: 
Problem: 
The type of employment is considered to be what type of data? 
Solution: 


Qualitative 


Exercise: 


Problem: 


Find the probability that a randomly selected mother works part-time 
given that she is employed. 


Solution: 


0.36 
Exercise: 


Problem: 


Find the probability that a randomly selected person from the 
population will be employed OR work full-time. 


Solution: 


0.7636 
Exercise: 


Problem: 


Based upon the above information, are being employed AND working 
part-time: 


e a mutually exclusive events? Why or why not? 
e b independent events? Why or why not? 
Solution: 


e aNo, 
e bNo, 


[link] - [link] refer to the following: We randomly pick 10 mothers from 
the above population. We are interested in the number of the mothers that 
are employed. Let X =number of mothers that are employed. 


Exercise: 


Problem: State the distribution for _X. 


Solution: 


B(10,0.76) 


Exercise: 


Problem: Find the probability that at least 6 are employed. 


Solution: 


0.9330 
Exercise: 


Problem: 


We expect the Statistics Discussion Board to have, on average, 14 
questions posted to it per week. We are interested in the number of 
questions posted to it per day. 


e a Define X. 

¢ b What are the values that the random variable may take on? 

¢ c State the distribution for X. 

e d Find the probability that from 10 to 14 (inclusive) questions are 
posted to the Listserv on a randomly picked day. 


Solution: 


e a X = the number of questions posted to the Statistics Listserv 
per day 

ba] 01 23: 

e c X~P(2) 

e dO 


Exercise: 
Problem: 


A person invests $1000 in stock of a company that hopes to go public 
in 1 year. 


e The probability that the person will lose all his money after 1 year 
(i.e. his stock will be worthless) is 35%. 

e The probability that the person’s stock will still have a value of 
$1000 after 1 year (i.e. no profit and no loss) is 60%. 

e The probability that the person’s stock will increase in value by 
$10,000 after 1 year (i.e. will be worth $11,000) is 5%. 


Find the expected PROFIT after 1 year. 


Solution: 


$150 

Exercise: 
Problem: 
Rachel’s piano cost $3000. The average cost for a piano is $4000 with 
a standard deviation of $2500. Becca’s guitar cost $550. The average 
cost for a guitar is $500 with a standard deviation of $200. Matt’s 
drums cost $600. The average cost for drums is $700 with a standard 


deviation of $100. Whose cost was lowest when compared to his or her 
own instrument? Justify your answer. 


Solution: 


Matt 
Exercise: 


Problem: 


For each statement below, explain why each is either true or false. 


e a 25% of the data are at most 5. 

e b There is the same amount of data from 4 —5 as there is from 5 — 
ds 

e c There are no data values of 3. 

e d 50% of the data are 4. 


Solution: 


e a False 
e b True 
e c False 
e d False 


[link] — [link] refer to the following: 64 faculty members were asked the 
number of cars they owned (including spouse and children’s cars). The 
results are given in the following graph: 

relative 


frequency 
045 


0 1 2 3 4 5 6 7 number of cars 


Exercise: 


Problem: Find the approximate number of responses that were “3.” 


Solution: 


16 
Exercise: 


Problem: 


Find the first, second and third quartiles. Use them to construct a box 
plot of the data. 


Solution: 


54 


[link] — [link] refer to the following study done of the Girls soccer team 
“Snow Leopards”: 


Hair Style Hair Color 

blond brown black 
ponytail 3 Z 5 
plain 2 2 1 


Suppose that one girl from the Snow Leopards is randomly selected. 
Exercise: 


Problem: 


Find the probability that the girl has black hair GIVEN that she wears 
a ponytail. 


Solution: 


Exercise: 


Problem: 


Find the probability that the girl wears her hair plain OR has brown 
hair. 
Solution: 


ate 
15 


Exercise: 


Problem: 


Find the probability that the girl has blond hair AND that she wears 
her hair plain. 


Solution: 


2 
15 


Continuous Random Variables: Lab I 

In this lab exercise, students will compare and contrast empirical data using 
Minitab with the Uniform Distribution. Note: This module is based on a 
student being able to access the Minitab statistical program. This modu 
Continuous Distribution Lab 


Name: 


I - Student Learning Outcomes: 


¢ The student will compare and contrast empirical data from a random 
number generator with the Uniform Distribution. 


II - Theoretical Distribution 


The theoretical distribution of X is X~U (0, 1). Use it for this part. In 
theory, 


LU = o= 1° quartile = 


AOth percentile = 3rd quartile = Median = 


III Collect the Data 

Use Minitab to generate 100 values between 0 and 1 (inclusive). (Calc 
~ Random Data — Uniform) Using Minitab, calculate the following 
(include the session window): 


x= s= 1% quartile = 


AOth percentile = (justify) 3rd quartile = median = 


IV - Comparing the Data 


1. For each part below, use a complete sentence to comment on how the 
value obtained from the experimental data (see part III) compares to the 
theoretical value you expected from the distribution in section II. (How it is 
reflected in the corresponding data. Be specific!) 


a. minimum value: 

b. first quartile: 

c. median: 

d. third quartile 

e. maximum value: 

f. width of IQR: 

V - Plotting the Data and Interpreting the Graphs. 


1. What does the probability graph for the theoretical distribution look like? 
Draw it here and label the axis. 


2. Use Minitab to construct a histogram a using 5 bars and density as the y- 
axis value. Be sure to attach the graphs to this lab. 


a. Describe the shape of the histogram. Use 2 - 3 complete sentences. (Keep 
it simple. Does the graph go straight across, does it have a V shape, does it 
have a hump in the middle or at either end, etc.? One way to help you 
determine the shape is to roughly draw a smooth curve through the top of 
the bars.) 


b. How does this histogram compare to the graph of the theoretical uniform 
distribution? Draw the horizontal line which represents the theoretical 
distribution on the histogram for ease of comparison. Be sure to use 2 — 3 
complete sentences. 


3. Draw the box plot for the theoretical distribution and label the axis. 


4. Construct a box plot of the experimental data using Minitab and attach 
the graph. 


a. Do you notice any potential outliers? 
If so, which values are they? 
b. Numerically justify your answer using the appropriate formulas. 


c. How does this plot compare to the box plot of the theoretical uniform 
distribution? Be sure to use 2 — 3 complete sentences. 


VI - Increasing the sample size. Repeat the simulation with 500 
data values. 


1. Using Minitab, calculate the following (include the session window): 
x= s= 1% quartile = 

AOth percentile = (justify) 3rd quartile = median = 
2. Does this data appear to reflect the theoretical data more closely than the 
original? Be sure to use 2 — 3 complete sentences. (Be specific.) 


3. Create a histogram with 5 bars and using density for the y-axis and box 
plot for this data. (attach to this lab) 


4. How do these compare to the theoretical distribution? Be sure to use 2 — 
3 complete sentences. (Be specific.) 


The Normal Distribution 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Recognize the normal probability distribution and apply it 
appropriately. 

e Recognize the standard normal probability distribution and apply it 
appropriately. 

e Compare normal probabilities by converting to the standard normal 
distribution. 


Introduction 


The normal, a continuous distribution, is the most important of all the 
distributions. It is widely used and even more widely abused. Its graph is 
bell-shaped. You see the bell curve in almost all disciplines. Some of these 
include psychology, business, economics, the sciences, nursing, and, of 
course, mathematics. Some of your instructors may use the normal 
distribution to help determine your grade. Most IQ scores are normally 
distributed. Often real estate prices fit a normal distribution. The normal 
distribution is extremely important but it cannot be applied to everything in 
the real world. 


In this chapter, you will study the normal distribution, the standard normal, 
and applications associated with them. 


Optional Collaborative Classroom Activity 


Your instructor will record the heights of both men and women in your 
class, separately. Draw histograms of your data. Then draw a smooth curve 
through each histogram. Is each curve somewhat bell-shaped? Do you think 
that if you had recorded 200 data values for men and 200 for women that 
the curves would look bell-shaped? Calculate the mean for each data set. 
Write the means on the x-axis of the appropriate graph below the peak. 


Shade the approximate area that represents the probability that one 
randomly chosen male is taller than 72 inches. Shade the approximate area 
that represents the probability that one randomly chosen female is shorter 
than 60 inches. If the total area under each curve is one, does either 
probability appear to be more than 0.5? 


The normal distribution has two parameters (two numerical descriptive 
measures), the mean (jz) and the standard deviation (a). If X is a quantity to 
be measured that has a normal distribution with mean (jz) and the standard 
deviation (a), we designate this by writing 


NORMAL: X~N(u, 0) 


The probability density function is a rather complicated function. Do not 
memorize it. It is not necessary. 


f(e) =e HON 


The cumulative distribution function is P(X < 2) . It is calculated either 
by a calculator or a computer or it is looked up in a table. Technology has 
made the tables basically obsolete. For that reason, as well as the fact that 
there are various table formats, we are not including table instructions in 
this chapter. See the NOTE in this chapter in Calculation of Probabilities. 


The curve is symmetrical about a vertical line drawn through the mean, u. 
In theory, the mean is the same as the median since the graph is symmetric 
about pt. As the notation indicates, the normal distribution depends only on 
the mean and the standard deviation. Since the area under the curve must 


equal one, a change in the standard deviation, 0, causes a change in the 
shape of the curve; the curve becomes fatter or skinnier depending on o. A 
change in pt causes the graph to shift to the left or right. This means there 
are an infinite number of normal probability distributions. One of special 
interest is called the standard normal distribution. 


Glossary 


Normal Distribution 


A continuous random variable (RV) with pdf 
2 
f(x) = —he~(-w)’/20"| where ps is the mean of the distribution and 
ov 2n 


o is the standard deviation. Notation: X ~ N(y, 0). If w = 0 and 
o = 1, the RV is called the standard normal distribution. 


The Standard Normal Distribution 


The standard normal distribution is a normal distribution of 
standardized values called z-scores. A z-score is measured in units of 
the standard deviation. For example, if the mean of a normal distribution 
is 5 and the standard deviation is 2, the value 11 is 3 standard deviations 
above (or to the right of) the mean. The calculation is: 

Equation: 


The z-score is 3. 


The mean for the standard normal distribution is 0 and the standard 
deviation is 1. The transformation 


z= =* produces the distribution Z~ N(0,1) —. The value x comes 


from a normal distribution with mean p and standard deviation o. 


Glossary 


Standard Normal Distribution 
A continuous random variable (RV) X~N(0,1 ).. When X follows the 
standard normal distribution, it is often noted as Z~N(0,1). 


z-score 
The linear transformation of the form z = =. If this transformation 
is applied to any normal distribution X~N(,c) , the result is the 
standard normal distribution Z~N (0,1). If this transformation is 
applied to any specific value x of the RV with mean p and standard 
deviation o , the result is called the z-score of x. Z-scores allow us to 
compare data that are normally distributed but scaled differently. 


Z-Scores 


If X is a normally distributed random variable and X~N(u, o), then the z- 
score is: 
Equation: 


The z-score tells you how many standard deviations that the value z is 
above (to the right of) or below (to the left of) the mean, jz. Values of z 
that are larger than the mean have positive z-scores and values of x that are 
smaller than the mean have negative z-scores. If x equals the mean, then x 
has a z-score of 0. 


Example: 

Suppose X ~ N(5, 6). This says that X is a normally distributed random 
variable with mean pp = 5 and standard deviation o = 6. Suppose x = 17. 
Then: 

Equation: 


This means that x = 17 is 2 standard deviations (20) above or to the 
right of the mean pt = 5. The standard deviation is o = 6. 

Notice that: 

Equation: 


Do 2 16 —vls, (The pattern is + zo = 2.) 


Now suppose x=1. Then: 
Equation: 


= ie 
= —- = —0.67 (rounded to two decimal places) 
o 


This means that x = 1 is 0.67 standard deviations (- 0.670) below or to 
the left of the mean pi = 5. Notice that: 

5 + (—0.67)(6) is approximately equal to 1 (This has the pattern 
p+ (—0.67)o = 1) 

Summarizing, when z is positive, x is above or to the right of and when 
zis negative, x is to the left of or below p. 


Example: 

Some doctors believe that a person can lose 5 pounds, on the average, in a 
month by reducing his/her fat intake and by exercising consistently. 
Suppose weight loss has a normal distribution. Let X = the amount of 
weight lost (in pounds) by a person in a month. Use a standard deviation of 
2 pounds. X~N(5, 2). Fill in the blanks. 

Exercise: 


Problem: 


Suppose a person lost 10 pounds in a month. The z-score when 
x = 10 pounds is z = 2.5 (verify). This z-score tells you that x = 10 


is standard deviations to the (right or left) of the 
mean (What is the mean?). 
Solution: 


This z-score tells you that x = 10 is 2.5 standard deviations to the 
right of the mean 5. 


Exercise: 
Problem: 
Suppose a person gained 3 pounds (a negative weight loss). Then z = 


. This z-score tells you that x = -3 is standard 
deviations to the (right or left) of the mean. 


Solution: 


= -4. This z-score tells you that x = -3 is 4 standard deviations to 
the left of the mean. 
Suppose the random variables X and Y have the following normal 
distributions: X ~N(5, 6) and Y ~ N(2, 1). Ifx = 17, then z = 2. (This 
was previously shown.) If y = 4, what is z? 
Equation: 


= SS where p}=2 and o=1. 


The z-score for y = 4 is z = 2. This means that 4 is z = 2 standard 
deviations to the right of the mean. Therefore, x = 17 and y = 4 are both 2 
(of their) standard deviations to the right of their respective means. 

The z-score allows us to compare data that are scaled differently. To 
understand the concept, suppose X ~N(5, 6) represents weight gains for 
one group of people who are trying to gain weight in a 6 week period and 
Y ~N(2, 1) measures the same weight gain for a second group of people. 
A negative weight gain would be a weight loss. Since x = 17 and y = 4 
are each 2 standard deviations to the right of their means, they represent 
the same weight gain relative to their means. 


The Empirical Rule 
If X is arandom variable and has a normal distribution with mean p and 
standard deviation o then the Empirical Rule says (See the figure below) 


e About 68.27% of the x values lie between -1o0 and +10 of the mean p 
(within 1 standard deviation of the mean). 

e About 95.45% of the x values lie between -20 and +20 of the mean p 
(within 2 standard deviations of the mean). 

e About 99.73% of the x values lie between -30 and +30 of the mean p 
(within 3 standard deviations of the mean). Notice that almost all the x 
values lie within 3 standard deviations of the mean. 

e The z-scores for +1o and —1o are +1 and -1, respectively. 

e The z-scores for +20 and —2o0 are +2 and -2, respectively. 

e The z-scores for +30 and —3a0 are +3 and -3 respectively. 


xX 
—30-—20-lo wp lo 20 30 


The Empirical Rule is also known as the 68-95-99.7 Rule. 


Example: 


Suppose X has a normal distribution with mean 50 and standard deviation 
6. 


e About 68.27% of the x values lie between -1o = (-1)(6) = -6 and lo = 
(1)(6) = 6 of the mean 50. The values 50 - 6 = 44 and 50 + 6 = 56 are 
within 1 standard deviation of the mean 50. The z-scores are -1 and 
+1 for 44 and 56, respectively. 

e About 95.45% of the x values lie between -20 = (-2)(6) = -12 and 20 
= (2)(6) = 12 of the mean 50. The values 50 - 12 = 38 and 50 + 12 = 
62 are within 2 standard deviations of the mean 50. The z-scores are 
-2 and 2 for 38 and 62, respectively. 

e About 99.73% of the x values lie between -3o0 = (-3)(6) = -18 and 30 
= (3)(6) = 18 of the mean 50. The values 50 - 18 = 32 and 50 + 18 = 
68 are within 3 standard deviations of the mean 50. The z-scores are 
-3 and +3 for 32 and 68, respectively. 


Normal Distribution: Areas to the Left and Right of x 


The arrow in the graph below points to the area to the left of x. This area is 
represented by the probability P X «x . Normal tables, computers, and 
calculators provide or calculate the probability P X 2c. 


P(X <x) 


X 


The area to the rightisthenP X cz Px Be 
Remember, P X « _ Areato the left of the vertical line through z. 


Px PX «_ .Areato the right of the vertical line 
through x 


PX « isthesameasP X 2 andPX_~ gz isthesameas 
PX « for continuous distributions. 


Calculations of Probabilities 


Probabilities are calculated by using technology. There are instructions in 
the chapter for the TI-83+ and TI-84 calculators. 


Note:In the Table of Contents for Collaborative Statistics, entry 15. 
Tables has a link to a table of normal probabilities. Use the probability 
tables if so desired, instead of a calculator. The tables include instructions 
for how to use then. 


Example: 
If the area to the left is 0.0228, then the area to the right is 
1 — 0.0228 = 0.9772. 


Example: 

The final exam scores in a statistics class were normally distributed with a 
mean of 63 and a standard deviation of 5. 

Exercise: 


Problem: 


Find the probability that a randomly selected student scored more than 
65 on the exam. 


Solution: 


Let X = ascore on the final exam. X~N(63, 5), where ps = 63 and 
(5) 


Draw a graph. 


Then, find P(« > 65). 


P(x > 65) = 0.3446 (calculator or computer) 


0.3446 


a 


The probability that one student scores more than 65 is 0.3446. 


63 65 


Using the TI-83+ or the TI-84 calculators, the calculation is as 
follows. Go into 2nd DISTR. 


After pressing 2nd DISTR, press 2:normalcdf. 
The syntax for the instructions are shown below. 


normalcdf(lower value, upper value, mean, standard deviation) For 
this problem: normalcdf(65,1E99,63,5) = 0.3446. You get 1E99 ( = 
10°°) by pressing 1, the EE key (a 2nd key) and then 99. Or, you can 
enter 10/99 instead. The number 10% is way out in the right tail of 
the normal curve. We are calculating the area between 65 and 10°”. In 
some instances, the lower number of the area might be -1E99 ( = 
—10%). The number —10% is way out in the left tail of the normal 
curve. 


Note:The TI probability program calculates a z-score and then the 
probability from the z-score. Before technology, the z-score was 
looked up in a standard normal probability table (because the math 
involved is too cumbersome) to find the probability. In this example, 


a standard normal table with area to the left of the z-score was used. 
You calculate the z-score and look up the area to the left. The 
probability is the area to the right. 


ea S88 — 0.4 . Area to the left is 0.6554. 
(pe Oa) 0 le ood Ur ado 


Exercise: 


Problem: 


Find the probability that a randomly selected student scored less than 
85. 


Solution: 


Draw a graph. 


Then find P(x < 85). Shade the graph. P(a < 85) = 1 (calculator 
or computer) 


The probability that one student scores less than 85 is approximately 1 
(or 100%). 


The TI-instructions and answer are as follows: 


normalcdf(0,85,63,5) = 1 (rounds to 1) 
Exercise: 


Problem: 


Find the 90th percentile (that is, find the score k that has 90 % of the 
scores below k and 10% of the scores above k). 


Solution: 


Find the 90th percentile. For each problem or part of a problem, draw 
a new graph. Draw the x-axis. Shade the area that corresponds to the 
90th percentile. 


Let k = the 90th percentile. & is located on the x-axis. P(x < k) is 
the area to the left of k. The 90th percentile k separates the exam 
scores into those that are the same or lower than & and those that are 
the same or higher. Ninety percent of the test scores are the same or 
lower than & and 10% are the same or higher. k is often called a 
critical value. 


k = 69.4 (calculator or computer) 


P(x < k) = 0.90 


The 90th percentile is 69.4. This means that 90% of the test scores fall 
at or below 69.4 and 10% fall at or above. For the TI-83+ or TI-84 
calculators, use LnvNormin 2nd DISTR. invNorm(area to the left, 
mean, standard deviation) For this problem, invNorm(0.90,63,5) = 
69.4 


Exercise: 


Problem: 


Find the 70th percentile (that is, find the score k such that 70% of 
scores are below k and 30% of the scores are above k). 


Solution: 


Find the 70th percentile. 
Draw a new graph and label it appropriately. k = 65.6 


The 70th percentile is 65.6. This means that 70% of the test scores fall 
at or below 65.5 and 30% fall at or above. 


invNorm(0.70,63,5) = 65.6 


Example: 

A computer is used for office work at home, research, communication, 
personal finances, education, entertainment, social networking and a 
myriad of other things. Suppose that the average number of hours a 
household personal computer is used for entertainment is 2 hours per day. 
Assume the times for entertainment are normally distributed and the 
standard deviation for the times is half an hour. 

Exercise: 


Problem: 


Find the probability that a household personal computer is used 
between 1.8 and 2.75 hours per day. 


Solution: 


Let X = the amount of time (in hours) a household personal computer 
is used for entertainment. x~N(2, 0.5) where « = 2 and o = 0.5. 


Find P(1.8 < x < 2.75). 


The probability for which you are looking is the area between 
Les AMC aN) peel Out 2 id) OOO 


18 5 2.75. Xx 


normalcdf(1.8,2.75,2,0.5) = 0.5886 
The probability that a household personal computer is used between 
1.8 and 2.75 hours per day for entertainment is 0.5886. 
Exercise: 
Problem: 


Find the maximum number of hours per day that the bottom quartile 
of households use a personal computer for entertainment. 


Solution: 


To find the maximum number of hours per day that the bottom 
quartile of households uses a personal computer for entertainment, 
find the 25th percentile, k, where P(x < k) = 0.25. 


k= 1.67 


P(x > k) = 0.75 
P(x <k) = 0.25 ia 


invNorm(0.25,2,.5) = 1.66 


The maximum number of hours per day that the bottom quartile of 
households uses a personal computer for entertainment is 1.66 hours. 


Summary of Formulas 
Formula 
Normal Probability Distribution 


X~N(p,0) 


j = the mean o = the standard deviation 
Formula 
Standard Normal Probability Distribution 


Z~N(0, 1) 
z = a Standardized value (z-score) 


mean = 0 standard deviation = 1 
Formula 
Finding the kth Percentile 


To find the kth percentile when the z-score is known: k = w+ (z)o 
Formula 
z-score 
o—U 
on 
Formula 


Finding the area to the left 


— 


The area to the left: P(X < x) 
Formula 
Finding the area to the right 


The area to the right: P(X > 7) = 1— P(X < 2) 


Practice: The Normal Distribution 


Student Learning Outcomes 


e The student will analyze data following a normal distribution. 


Given 


The life of Sunshine CD players is normally distributed with a mean of 4.1 
years and a standard deviation of 1.3 years. A CD player is guaranteed for 3 
years. We are interested in the length of time a CD player lasts. 


Normal Distribution 
Exercise: 


Problem: Define the Random Variable X in words. X = 


Exercise: 


Problem: X~ 
Exercise: 


Problem: 


Find the probability that a CD player will break down during the 
guarantee period. 


e a Sketch the situation. Label and scale the axes. Shade the region 
corresponding to the probability. 


ebP(0<a< = (Use zero (0) for the 
minimum value of x.) 


Solution: 
¢ b3,0.1979 
Exercise: 
Problem: 
Find the probability that a CD player will last between 2.8 and 6 years. 


e a Sketch the situation. Label and scale the axes. Shade the region 
corresponding to the probability. 


‘bP(._. <a< = 


Solution: 
e b 2.8,6,0.7694 
Exercise: 
Problem: 


Find the 70th percentile of the distribution for the time a CD player 
lasts. 


e a Sketch the situation. Label and scale the axes. Shade the region 
corresponding to the lower 70%. 


___——, Therefore, k = 


© bP(4#<k)= 


Solution: 


¢ b0.70,4.78years 


Homework 
Exercise: 


Problem: 


According to a study done by De Anza students, the height for Asian 
adult males is normally distributed with an average of 66 inches and a 
standard deviation of 2.5 inches. Suppose one Asian adult male is 
randomly chosen. Let X =height of the individual. 


e aX~ ( ; ) 

e b Find the probability that the person is between 65 and 69 
inches. Include a sketch of the graph and write a probability 
statement. 

e¢ c Would you expect to meet many Asian adult males over 72 
inches? Explain why or why not, and justify your answer 
numerically. 

e d The middle 40% of heights fall between what two values? 
Sketch the graph and write the probability statement. 


Solution: 


e a N(66,2.5) 

e b0.5404 

e cNo 

e dBetween 64.7 and 67.3 inches 


Exercise: 


Problem: 


IQ is normally distributed with a mean of 100 and a standard deviation 
of 15. Suppose one individual is randomly chosen. Let X =IQ of an 
individual. 


e a X~ ( : ) 
e b Find the probability that the person has an IQ greater than 120. 
Include a sketch of the graph and write a probability statement. 


¢ c Mensa is an organization whose members have the top 2% of all 
IQs. Find the minimum IQ needed to qualify for the Mensa 
organization. Sketch the graph and write the probability 
statement. 

e d The middle 50% of IQs fall between what two values? Sketch 
the graph and write the probability statement. 


Exercise: 


Problem: 


The percent of fat calories that a person in America consumes each 
day is normally distributed with a mean of about 36 and a standard 
deviation of 10. Suppose that one individual is randomly chosen. Let 
X =percent of fat calories. 


e a X~ ( F ) 

e b Find the probability that the percent of fat calories a person 
consumes is more than 40. Graph the situation. Shade in the area 
to be determined. 

e c Find the maximum number for the lower quarter of percent of 
fat calories. Sketch the graph and write the probability statement. 


Solution: 


¢ a N(36,10) 
° b 0.3446 
© ¢ 29.3 


Exercise: 
Problem: 
Suppose that the distance of fly balls hit to the outfield (in baseball) is 


normally distributed with a mean of 250 feet and a standard deviation 
of 50 feet. 


e alf X = distance in feet for a fly ball, then X~ 
( ) 

¢ b If one fly ball is randomly chosen from this distribution, what is 
the probability that this ball traveled fewer than 220 feet? Sketch 
the graph. Scale the horizontal axis X. Shade the region 
corresponding to the probability. Find the probability. 

e c Find the 80th percentile of the distribution of fly balls. Sketch 
the graph and write the probability statement. 


Exercise: 


Problem: 


In China, 4-year-olds average 3 hours a day unsupervised. Most of the 
unsupervised children live in rural areas, considered safe. Suppose that 
the standard deviation is 1.5 hours and the amount of time spent alone 
is normally distributed. We randomly survey one Chinese 4-year-old 
living in a rural area. We are interested in the amount of time the child 
spends alone per day. (Source: San Jose Mercury News) 


e alin words, define the random variable X.X = 

© bxX~ 

e c Find the probability that the child spends less than 1 hour per 
day unsupervised. Sketch the graph and write the probability 
statement. 

e d What percent of the children spend over 10 hours per day 
unsupervised? 

e e 70% of the children spend at least how long per day 
unsupervised? 


Solution: 


e a the time (in hours) a 4-year-old in China spends unsupervised 
per day 

¢ b N(3,1.5) 

e c 0.0912 

e d0 


e 2.21 hours 


Exercise: 


Problem: 


In the 1992 presidential election, Alaska’s 40 election districts 
averaged 1956.8 votes per district for President Clinton. The standard 
deviation was 572.3. (There are only 40 election districts in Alaska.) 
The distribution of the votes per district for President Clinton was bell- 
shaped. Let X = number of votes for President Clinton for an election 
district. (Source: The World Almanac and Book of Facts) 


a State the approximate distribution of X. X~ 

b Is 1956.8 a population mean or a sample mean? How do you 
know? 

c Find the probability that a randomly selected district had fewer 
than 1600 votes for President Clinton. Sketch the graph and write 
the probability statement. 

d Find the probability that a randomly selected district had 
between 1800 and 2000 votes for President Clinton. 

e Find the third quartile for votes for President Clinton. 


Exercise: 


Problem: 


Suppose that the duration of a particular type of criminal trial is known 
to be normally distributed with a mean of 21 days and a standard 
deviation of 7 days. 


a In words, define the random variable X. X = 

b X~ 

c If one of the trials is randomly chosen, find the probability that 
it lasted at least 24 days. Sketch the graph and write the 
probability statement. 

d 60% of all of these types of trials are completed within how 
many days? 


Solution: 


e aThe duration of a criminal trial 
¢ b N(21,7) 

¢ ¢ 0.3341 

e d 22.77 


Exercise: 


Problem: 


Terri Vogel, an amateur motorcycle racer, averages 129.71 seconds per 
2.5 mile lap (in a 7 lap race) with a standard deviation of 2.28 seconds 
. The distribution of her race times is normally distributed. We are 
interested in one of her randomly selected laps. (Source: log book of 
Terri Vogel) 


e a ln words, define the random variable X.X = 
© bxXe 
e c Find the percent of her laps that are completed in less than 130 
seconds. 
e d The fastest 3% of her laps are under ; 
e e The middle 80% of her laps are from seconds to 
seconds. 


Exercise: 


Problem: 


Thuy Dau, Ngoc Bui, Sam Su, and Lan Voung conducted a survey as 
to how long customers at Lucky claimed to wait in the checkout line 
until their turn. Let _X =time in line. Below are the ordered real data 
(in minutes): 


0.50 4.25 2) 6 720 


1.75 4.25 5.25 6 peas 
4.25 D205 6.25 Ao 
2:25 4.25 ops) 6.25 719 
229 4.5 5.0 6.5 8 
2.5 4.75 5.0 6.5 8.25 
2:15 4.75 D./0 6.5 9.5 
3.20 4.75 9.75 6.75 9.5 
3.75 fs) 6 6.75 O37) 
3.75 fs) 6 6.75 10.75 


a Calculate the sample mean and the sample standard deviation. 
b Construct a histogram. Start the x — axis at —0.375 and make 
bar widths of 2 minutes. 

c Draw a smooth curve through the midpoints of the tops of the 
bars. 

d In words, describe the shape of your histogram and smooth 
curve. 

e Let the sample mean approximate pz and the sample standard 
deviation approximate o. The distribution of X can then be 
approximated by X~ 

f Use the distribution in (e) to calculate the probability that a 
person will wait fewer than 6.1 minutes. 

g Determine the cumulative relative frequency for waiting less 
than 6.1 minutes. 

h Why aren’t the answers to (f) and (g) exactly the same? 

i Why are the answers to (f) and (g) as close as they are? 


e j If only 10 customers were surveyed instead of 50, do you think 
the answers to (f) and (g) would have been closer together or 
farther apart? Explain your conclusion. 


Solution: 


e a The sample mean is 5.51 and the sample standard deviation is 
215 

e e N(5.51,2.15) 

¢ f 0.6081 

e g 0.64 


Exercise: 


Problem: 


Suppose that Ricardo and Anita attend different colleges. Ricardo’s 
GPA is the same as the average GPA at his school. Anita’s GPA is 0.70 
standard deviations above her school average. In complete sentences, 
explain why each of the following statements may be false. 


e a Ricardo’s actual GPA is lower than Anita’s actual GPA. 
e b Ricardo is not passing since his z-score is zero. 
e c Anita is in the 70th percentile of students at her college. 


Exercise: 


Problem: 


Below is a sample of the maximum capacity (maximum number of 
spectators) of sports stadiums. The table does not include horse racing 
or motor racing stadiums. (Source: 
http://en.wikipedia.org/wiki/List_of_stadiums_by_capacity) 


40,000 
49,133 
51,500 
52,692 
59,000 
59,680 
62,872 
66,161 
70,585 


75,025 


40,000 
50,071 
51,900 
53,864 
59,000 
60,000 
64,035 
67,428 
71,594 


76,212 


45,050 
50,096 
52,000 
54,000 
55,000 
60,000 
65,000 
68,349 
72,000 


78,000 


45,500 
50,466 
52,132 
55,000 
55,082 
60,492 
65,050 
68,976 
72,922 


80,000 


46,249 
50,832 
52,200 
59,000 
57,000 
60,580 
65,647 
69,372 
735379 


80,000 


48,134 
51,100 
52,530 
55,000 
58,008 
62,380 
66,000 
70,107 
74,500 


82,300 


e a Calculate the sample mean and the sample standard deviation 


for the maximum capacity of sports stadiums (the data). 


e b Construct a histogram of the data. 
e c Draw a smooth curve through the midpoints of the tops of the 


bars of the histogram. 


e dIn words, describe the shape of your histogram and smooth 


curve. 


e e Let the sample mean approximate p and the sample standard 
deviation approximate o. The distribution of X can then be 


approximated by X~ 


e f Use the distribution in (e) to calculate the probability that the 


maximum capacity of sports stadiums is less than 67,000 


spectators. 
e g Determine the cumulative relative frequency that the maximum 
capacity of sports stadiums is less than 67,000 spectators. Hint: 


Order the data and count the sports stadiums that have a 


maximum capacity less than 67,000. Divide by the total number 
of sports stadiums in the sample. 
e h Why aren’t the answers to (f) and (g) exactly the same? 


Solution: 


e a The sample mean is 60,136.4 and the sample standard deviation 
is 10,468.1. 

e e N(60136.4,10468.1) 

¢ f 0.7440 

e« g 0.7167 


Try These Multiple Choice Questions 


The questions below refer to the following: The patient recovery time 
from a particular surgical procedure is normally distributed with a mean of 
5.3 days and a standard deviation of 2.1 days. 

Exercise: 


Problem: What is the median recovery time? 


© A2.7 
e B53 
e C74 
e D2.1 


Solution: 


B 
Exercise: 


Problem: 


What is the z-score for a patient who takes 10 days to recover? 


Al.5 
B 0.2 
C 2.2 
D 7.3 


Solution: 


C 
Exercise: 


Problem: 
What is the probability of spending more than 2 days in recovery? 


e A 0.0580 
e B 0.8447 
e € 0.0553 
e D 0.9420 


Solution: 


D 


Exercise: 


Problem: The 90th percentile for recovery times is? 


e A8.89 
¢ B7.07 
e C 7.99 
e D432 


Solution: 


@ 


The questions below refer to the following: The length of time to find a 
parking space at 9 A.M. follows a normal distribution with a mean of 5 
minutes and a standard deviation of 2 minutes. 

Exercise: 


Problem: 


Based upon the above information and numerically justified, would 
you be surprised if it took less than 1 minute to find a parking space? 


e A Yes 
e BNo 
e C Unable to determine 


Solution: 


A 
Exercise: 


Problem: 


Find the probability that it takes at least 8 minutes to find a parking 
space. 


e A 0.0001 
¢ B 0.9270 
e C 0.1862 
e D 0.0668 


Solution: 


D 


Exercise: 


Problem: 


Seventy percent of the time, it takes more than how many minutes to 
find a parking space? 


© A1.24 
e B2.41 
e €3.95 
e D6.05 


Solution: 


G 
Exercise: 
Problem: 


If the mean is significantly greater than the standard deviation, which 
of the following statements is true? 


e I The data cannot follow the uniform distribution. 
e II The data cannot follow the exponential distribution.. 
e III The data cannot follow the normal distribution. 


¢ Alonly 

¢ BI only 

¢ C Ill only 

e DI, Ul, and Ill 


Solution: 


B 


Review 


The next two questions refer to: X ~ U(3, 13) 
Exercise: 


Problem: Explain which of the following are false and which are true. 


° a f(z) = 75,3 <2< 13 
e b There is no mode. 


e c The median is less than the mean. 
e dP(x > 10) = P(x < 6) 


Solution: 


e a True 

e b True 

e c False — the median and the mean are the same for this 
symmetric distribution 

e d True 


Exercise: 


Problem: Calculate: 


e a Mean 
e b Median 
e c 65th percentile. 


Solution: 


e a8 


°b8 
°c P(x < k) =0.65 = (k—3) *(4).k=9.5 


Exercise: 


Problem: Which of the following is true for the above box plot? 


e a 25% of the data are at most 5. 

¢ b There is about the same amount of data from 4 — 5 as there is 
from 5— 7. 

e c There are no data values of 3. 

e d 50% of the data are 4. 


Solution: 


e a False — + of the data are at most 5 


e b True — each quartile has 25% of the data 
e c False — that is unknown 
e d False — 50% of the data are 4 or less 


Exercise: 


Problem: 
If P(G | H) = P(G)), then which of the following is correct? 


e AG and H are mutually exclusive events. 

¢ BP(G) = P(A) 

e C Knowing that H has occurred will affect the chance that G will 
happen. 

¢ DG and H are independent events. 


Solution: 


D 


Exercise: 


Problem: 


If P(J) = 0.3, P(K) = 0.6, and J and K are independent events, 
then explain which are correct and which are incorrect. 


© AP 
° BP 
© CP 
« DP 


Jand Kk) =0 
Jor kK) = 0,9 
J or Kk) = 0.72 
J) # PJ | K) 


LN LN NON 


Solution: 


e A False - J and K are independent so they are not mutually 
exclusive which would imply dependency (meaning P(J and K) is 
not 0). 

e B False - see answer C. 

e C True - P(J or K) = P(J) + P(K) - P(J and K) = P(J) + P(K) - 
P(J)P(K) = 0.3 + 0.6 - (0.3)(0.6) = 0.72. Note that P(J and K) = 
P(J)P(K) because J and K are independent. 

¢ D False - J and K are independent so P(J) = P(J|K). 


Exercise: 


Problem: 


On average, 5 students from each high school class get full 
scholarships to 4-year colleges. Assume that most high school classes 
have about 500 students. 


X =the number of students from a high school class that get full 
scholarships to 4-year school. Which of the following is the 
distribution of X? 


e AP(5) 
¢ BB(500,5) 
e CExp(1/5) 


¢ DN(5, (0.01)(0.99)/500) 


Solution: 


A 


Normal Distribution: Normal Distribution Lab I (edited: Teegarden) 
Labs changed to incorporate mini-tabs. 


Normal Distribution Lab 


Name: 


I Student Learning Outcome: 


* The student will compare and contrast empirical data and a theoretical 
distribution. 


* Find Probabilities for specific Normal Distributions 


II The Situation 


It is generally accepted that the mean body temperature is 98.6 degrees. If a 
sample of size 100 resulted in a sample mean of 98.3 degrees with a 
standard deviation of 0.64 degrees. Does this sample suggest that the mean 
body temperature is actually lower than 98.6 degrees? 


III Simulation: To answer the question, complete the following 
simulation. 


Using Minitab (Calc -> Random Data-> Normal), generate 100 values from 
a normally distributed population with a mean of 98.6 degrees and a 
standard deviation of 0.64 degrees (using the sample standard deviation 
given in the situation since the population deviation is unknown). Repeat 
the simulation 9 more times for a total of 10. (Requesting the data be stored 
in c2-c10 will generate the remaining 9 columns of data with one command. 


) 


IV Data Collection 


Use Stats -> Basic Stats -> Display Descriptive and select all 10 columns to 
determine the sample mean for each data set. Record the values below and 


include the session window with this lab. 


We DS ee EG 
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V Analyze the Data — Using complete sentences. 


Based on your simulation, do you think that a sample of size 100 with a 
mean temperature of 98.3 is reasonable? Answer using 2 — 3 complete 
sentences. 


VI Finding Probabilities for Normal Distribution 


For each of the following, first write the question in symbolic form and 
then using Minitab (Calc -> Probability Distributions -> Normal), find the 
probabilities. (attach your session window to this 

lab) 


1. Given a population with a normal distribution, a mean of 0, and a 
standard deviation of 1, find the probability of a value less than 1.25 


2. Given a population with a normal distribution, a mean of 25, and a 
standard deviation of 3, find the probability of a value greater than 
21,25; 


3. Given a population with a normal distribution, a mean of 100, and a 
standard deviation of 20, find the probability of a value between 87 and 
122, 


4. Given a population with a normal distribution, a mean of 150, anda 
standard deviation of 35, what value has an area of 0.34 to the left? 


5. Given a population with a normal distribution, a mean of 150, and a 
standard deviation of 35, what value has an area of 0.34 to the right? 


6. Given a population with a normal distribution, a mean of 15, and a 
standard deviation of 2, what value has an area of 0.8 to the left? 


7. Given a population with a normal distribution, a mean of 200, and a 
standard deviation of 15, which two values form the upper and lower 
boundary of the middle 80%? 


The Central Limit Theorem 
This module provides a brief introduction to the Central Limit Theorem. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Recognize the Central Limit Theorem problems. 

¢ Classify continuous word problems by their distributions. 
e Apply and interpret the Central Limit Theorem for Means. 
e Apply and interpret the Central Limit Theorem for Sums. 


Introduction 


Why are we so concerned with means? Two reasons are that they give us a 
middle ground for comparison and they are easy to calculate. In this 
chapter, you will study means and the Central Limit Theorem. 


The Central Limit Theorem (CLT for short) is one of the most powerful 
and useful ideas in all of statistics. Both alternatives are concerned with 
drawing finite samples of size n from a population with a known mean, pL, 
and a known standard deviation, o. The first alternative says that if we 
collect samples of size n and n is "large enough," calculate each sample's 
mean, and create a histogram of those means, then the resulting histogram 
will tend to have an approximate normal bell shape. The second alternative 
says that if we again collect samples of size n that are "large enough," 
calculate the sum of each sample and create a histogram, then the resulting 
histogram will again tend to have a normal bell-shape. 


In either case, it does not matter what the distribution of the original 
population is, or whether you even need to know it. The important fact 
is that the sample means and the sums tend to follow the normal 
distribution. And, the rest you will learn in this chapter. 


The size of the sample, n, that is required in order to be to be ‘large enough’ 
depends on the original population from which the samples are drawn. If 
the original population is far from normal then more observations are 


needed for the sample means or the sample sums to be normal. Sampling is 
done with replacement. 


Optional Collaborative Classroom Activity 


Do the following example in class: Suppose 8 of you roll 1 fair die 10 
times, 7 of you roll 2 fair dice 10 times, 9 of you roll 5 fair dice 10 times, 
and 11 of you roll 10 fair dice 10 times. 


Each time a person rolls more than one die, he/she calculates the sample 
mean of the faces showing. For example, one person might roll 5 fair dice 
and get a 2, 2, 3, 4, 6 on one roll. 


The mean is The 3.4 is one mean when 5 fair dice 


242434446 _ 9 4 
4, 
are rolled. This same person would roll the 5 dice 9 more times and 


calculate 9 more means for a total of 10 means. 


Your instructor will pass out the dice to several people as described above. 
Roll your dice 10 times. For each roll, record the faces and find the mean. 
Round to the nearest 0.5. 


Your instructor (and possibly you) will produce one graph (it might be a 
histogram) for 1 die, one graph for 2 dice, one graph for 5 dice, and one 
graph for 10 dice. Since the "mean" when you roll one die, is just the face 
on the die, what distribution do these means appear to be representing? 


Draw the graph for the means using 2 dice. Do the sample means show 
any kind of pattern? 


Draw the graph for the means using 5 dice. Do you see any pattern 
emerging? 


Finally, draw the graph for the means using 10 dice. Do you see any 
pattern to the graph? What can you conclude as you increase the number of 
dice? 


As the number of dice rolled increases from 1 to 2 to 5 to 10, the following 
is happening: 


1. The mean of the sample means remains approximately the same. 

2. The spread of the sample means (the standard deviation of the sample 
means) gets smaller. 

3. The graph appears steeper and thinner. 


You have just demonstrated the Central Limit Theorem (CLT). 


The Central Limit Theorem tells you that as you increase the number of 
dice, the sample means tend toward a normal distribution (the 
sampling distribution). 


Glossary 


Average 
A number that describes the central tendency of the data. There are a 
number of specialized averages, including the arithmetic mean, 
weighted mean, median, mode, and geometric mean. 


Central Limit Theorem 
Given a random variable (RV) with known mean pu and known 
standard deviation o. We are sampling with size n and we are 
interested in two new RVs - the sample mean, X, and the sample sum, 
3/X. If the size n of the sample is sufficiently large, then X ~ 


N(n, <) and XX ~ N(np, no). If the size n of the sample is 


sufficiently large, then the distribution of the sample means and the 
distribution of the sample sums will approximate a normal distribution 
regardless of the shape of the population. The mean of the sample 
means will equal the population mean and the mean of the sample 
sums will equal n times the population ea The standard deviation 
of the distribution of the sample means, Tn , is called the standard 


error of the mean. 


The Central Limit Theorem for Sample Means (Averages) 


Suppose X is a random variable with a distribution that may be known or 
unknown (it can be any distribution). Using a subscript that matches the 
random variable, suppose: 


° ajx = the mean of X 
e boy = the standard deviation of X 


If you draw random samples of size n, then as n increases, the random 


variable X which consists of sample means, tends to be normally 
distributed and 


X~ N(x, ) 


The Central Limit Theorem for Sample Means says that if you keep 
drawing larger and larger samples (like rolling 1, 2, 5, and, finally, 10 dice) 
and calculating their means the sample means form their own normal 
distribution (the sampling distribution). The normal distribution has the 
Same mean as the original distribution and a variance that equals the 
original variance divided by n, the sample size. n is the number of values 
that are averaged together not the number of times the experiment is done. 


To put it more formally, if you draw random samples of size n,the 
distribution of the random variable X , which consists of sample means, is 
called the sampling distribution of the mean. The sampling distribution of 
the mean approaches a normal distribution as n, the sample size, increases. 


The random variable X has a different z-score associated with it than the 
random variable X. x is the value of X in one sample. 
Equation: 


wv PX 


@ 


— 


tx is both the average of X and of X. 


Oy = a4 = standard deviation of X and is called the standard error of 


the mean. 


Example: 

An unknown distribution has a mean of 90 and a standard deviation of 15. 
Samples of size n = 25 are drawn randomly from the population. 
Exercise: 


Problem: 

Find the probability that the sample mean is between 85 and 92. 
Solution: 

Let X = one value from the original unknown population. The 
probability question asks you to find a probability for the sample 


mean. 


Let X = the mean of a sample of size 25. Since wx = 90, 0x = 15, 
andn = 25; 


. “15 
then X N(90, 45.) 


Find P(85 < x < 92) Draw a graph. 


P(85 < a < 92) = 0.6997 


The probability that the sample mean is between 85 and 92 is 0.6997. 


P(8S < x < 92) 


A | 


85 90 92 


TI-83 or 84: normalcdf (lower value, upper value, mean, standard 
error of the mean) 


The parameter list is abbreviated (lower value, upper value, J, Wad 


normalcdf (85,92,90, = ) = 0.6997 
Exercise: 
Problem: 


Find the value that is 2 standard deviations above the expected value 
(it is 90) of the sample mean. 


Solution: 


To find the value that is 2 standard deviations above the expected 
value 90, use the formula 


value = wx + (#o0fSTDEVs) (+) 


n 


: eee 
value = 90 + 2 FE 96 


So, the value that is 2 standard deviations above the expected value is 
96. 


Example: 

The length of time, in hours, it takes an "over 40" group of people to play 
one soccer match is normally distributed with a mean of 2 hours and a 
standard deviation of 0.5 hours. A sample of size n = 50 is drawn 
randomly from the population. 

Exercise: 


Problem: 


Find the probability that the sample mean is between 1.8 hours and 
2.3 hours. 


Solution: 
Let X = the time, in hours, it takes to play one soccer match. 


The probability question asks you to find a probability for the sample 
mean time, in hours, it takes to play one soccer match. 


Let X = the mean time, in hours, it takes to play one soccer match. 


oie os a Sandi? — , then 
X ~N( ; ) by the Central Limit Theorem for Means. 


ix —2,0,% — 0.5, 77 — 50, and X-N(2, 45) 
Findbe eS 253), Draw a graph. 
GL 2a 2 AS) SO 


pee eee 
normalcdf(1.8,2.3,2, a5) — 0.9977 


The probability that the mean time is between 1.8 hours and 2.3 hours 
is 


Glossary 


Average 
A number that describes the central tendency of the data. There are a 
number of specialized averages, including the arithmetic mean, 
weighted mean, median, mode, and geometric mean. 


Central Limit Theorem 
Given a random variable (RV) with known mean pu and known 
standard deviation o. We are sampling with size n and we are 
interested in two new RVs - the sample mean, X, and the sample sum, 
3/X. If the size n of the sample is sufficiently large, then X ~ 


N(n, <) and XX ~ N (au, Jno). If the size n of the sample is 


sufficiently large, then the distribution of the sample means and the 
distribution of the sample sums will approximate a normal distribution 
regardless of the shape of the population. The mean of the sample 
means will equal the population mean and the mean of the sample 
sums will equal n times the population mean. The standard deviation 
of the distribution of the sample means, Ta is called the standard 


error of the mean. 


Normal Distribution 
A continuous random variable (RV) with pdf 


Ex) = rs e~(*-H)"/20° | where jy. is the mean of the distribution and 


o is the standard deviation. Notation: X ~ N(y,o). If uw = 0 and 
ao = 1, the RV is called the standard normal distribution. 


Standard Error of the Mean 


The standard deviation of the distribution of the sample means, — 


ane 


The Central Limit Theorem for Sums 


Suppose X is a random variable with a distribution that may be known or 
unknown (it can be any distribution) and suppose: 


° ax =the mean of X 
e boy = the standard deviation of X 


If you draw random samples of size n, then as n increases, the random 
variable 4X which consists of sums tends to be normally distributed and 


UX ~ N(n- px,./n- ox) 


The Central Limit Theorem for Sums says that if you keep drawing 
larger and larger samples and taking their sums, the sums form their own 
normal distribution (the sampling distribution) which approaches a normal 
distribution as the sample size increases. The normal distribution has a 
mean equal to the original mean multiplied by the sample size and a 
standard deviation equal to the original standard deviation multiplied 
by the square root of the sample size. 


The random variable _X has the following z-score associated with it: 


e a» x is one sum. 
e bz = 22amex 
J/n-ox 


° an- wx = the mean of © X 
e b./n- ox = standard deviation of X 


Example: 

An unknown distribution has a mean of 90 and a standard deviation of 15. 
A sample of size 80 is drawn randomly from the population. 

Exercise: 


Problem: 


¢ aFind the probability that the sum of the 80 values (or the total of 
the 80 values) is more than 7500. 
e bFind the sum that is 1.5 standard deviations above the mean of 
the sums. 
Solution: 
Let X = one value from the original unknown population. The 
probability question asks you to find a probability for the sum (or 


total of) 80 values. 


“X = the sum or total of 80 values. Since wx = 90, 7x = 15, and 
n = 80, then 


ee N(80 - 90, 80 - 15) 


e mean of the sums = n- 4x = (80)(90) = 7200 
e standard deviation of the sums = \/n- ox = V80- 15 
e sum of 80 values = x = 7500 


e aFind P(=x > 7500) 


JE( Dee 7151010) UA 


P > % > 7500 


yx 


7200 7500 


normalcdf (lower value, upper value, mean of sums, stdev of 
sums) 


The parameter list is abbreviated (lower, upper, n - x, \/n- ox) 


normalcdf(7500,1E99, 80 - 90, 80 - 15) = 0.0127 
Reminder: 1E99 = 10°°. Press the EE key for E. 


e bFind ix where z = 1.5: 
Ux=n-px t+ z-./n-ox = (80)(90) + (1.5)(V/80) (15) = 7401.2 


Glossary 


Central Limit Theorem 
Given a random variable (RV) with known mean p and known 
standard deviation o. We are sampling with size n and we are 
interested in two new RVs - the sample mean, X, and the sample sum, 
3/X. If the size n of the sample is sufficiently large, then X ~ 


N(n, a and 3X ~ N (nu, Jno). If the size n of the sample is 


sufficiently large, then the distribution of the sample means and the 
distribution of the sample sums will approximate a normal distribution 
regardless of the shape of the population. The mean of the sample 
means will equal the population mean and the mean of the sample 
sums will equal n times the population mean. The standard deviation 


of the distribution of the sample means, —~, is called the standard 


Jn ’ 


error of the mean. 


Normal Distribution 


A continuous random variable (RV) with pdf 


f(x) = TE e~(=—-H)"/20° where js is the mean of the distribution and 


o is the standard deviation. Notation: X ~ N(w,o). If u = 0 and 
ao = 1, the RV is called the standard normal distribution. 


Using the Central Limit Theorem 

Central Limit Theorem: Using the Central Limit Theorem is part of the 
collection col10555 written by Barbara Illowsky and Susan Dean. It covers 
how and when to use the Central Limit Theorem and has contributions from 
Roberta Bloom. 


It is important for you to understand when to use the CLT. If you are being 
asked to find the probability of the mean, use the CLT for the mean. If you 
are being asked to find the probability of a sum or total, use the CLT for 
sums. This also applies to percentiles for means and sums. 


Note:If you are being asked to find the probability of an individual value, 
do not use the CLT. Use the distribution of its random variable. 


Examples of the Central Limit Theorem 
Law of Large Numbers 


The Law of Large Numbers says that if you take samples of larger and 
larger size from any population, then the mean x of the sample tends to get 
closer and closer to pw. From the Central Limit Theorem, we know that as n 
gets larger and larger, the sample means follow a normal distribution. The 
larger n gets, the smaller the standard deviation gets. (Remember that the 


standard deviation for X is Va .) This means that the sample mean z must 
be close to the population mean jz. We can say that yz is the value that the 
sample means approach as n gets larger. The Central Limit Theorem 


illustrates the Law of Large Numbers. 


Central Limit Theorem for the Mean and Sum Examples 


Example: 


A study involving stress is done on a college campus among the students. 
The stress scores follow a uniform distribution with the lowest stress 
score equal to 1 and the highest equal to 5. Using a sample of 75 students, 
find: 


1. The probability that the mean stress score for the 75 students is less 
than 2. 

2. The 90th percentile for the mean stress score for the 75 students. 

3. The probability that the total of the 75 stress scores is less than 200. 

4. The 90th percentile for the total stress score for the 75 students. 


Let X = one stress score. 

Problems 1. and 2. ask you to find a probability or a percentile for a mean. 
Problems 3 and 4 ask you to find a probability or a percentile for a total or 
sum. The sample size, n, is equal to 75. 

Since the individual stress scores follow a uniform distribution, X ~ 
U(1,5) where a = 1 and b = 5 (See Continuous Random Variables for the 


year 
ale 4f°. _ = 


I= [te 4) GE oe = Lie 


For iene 1. and 2., let X = the mean stress score for the 75 students. 
Then, 


Ee 115 = 

x N(3, 115 ) where n = 75. 

Exercise: 
Problem: Find P(x < 2). Draw the graph. 
Solution: 
P(x <2)=0 


The probability that the mean stress score is less than 2 is about 0. 


p(x < 2) 


normalcdf Woe 118 | = i 
V75 


Note:The smallest stress score is 1. Therefore, the smallest mean for 
75 stress scores is 1. 


Exercise: 


Problem: 


Find the 90th percentile for the mean of 75 stress scores. Draw a 
graph. 


Solution: 
Let k = the 90th precentile. 
Find k where P(x < k) = 0.90. 


oS a 


P(x < k)= 0.90 


x 


The 90th percentile for the mean of 75 scores is about 3.2. This tells 
us that 90% of all the means of 75 stress scores are at most 3.2 and 
10% are at least 3.2. 


1 75 
For problems c and d, let 3:X = the sum of the 75 stress scores. Then, 1X 
. N|(75) (2), Nae 1.15 | 


Exercise: 


invNorm (.90, 3 LAs ) = 3.2 


Problem: Find P(x < 200). Draw the graph. 


Solution: 
The mean of the sum of 75 stress scores is 75 - 3 = 225 


The standard deviation of the sum of 75 stress scores is 
V/75-1.15 = 9.96 


P(=x < 200) =0 


| = 00 


200 225 ms 


The probability that the total of 75 scores is less than 200 is about 0. 


normalcdf (75, 200, 75 3,75 - 1.15) =. 


Note: The smallest total of 75 stress scores is 75 since the smallest 
single score is 1. 


Exercise: 


Problem: 

Find the 90th percentile for the total of 75 stress scores. Draw a graph. 
Solution: 

Let k = the 90th percentile. 

Find k where P(Xx < k) = 0.90. 


k = 237.8 


225 &k = 


The 90th percentile for the sum of 75 scores is about 237.8. This tells 
us that 90% of all the sums of 75 scores are no more than 237.8 and 
10% are no less than 237.8. 


invNorm (.90, 75 aoe Toe 1.15) ~ 237.8 


Example: 

Suppose that a market research analyst for a cell phone company conducts 
a study of their customers who exceed the time allowance included on their 
basic cell phone contract; the analyst finds that for those people who 
exceed the time included in their basic contract, the excess time used 
follows an exponential distribution with a mean of 22 minutes. 

Consider a random sample of 80 customers who exceed the time allowance 
included in their basic cell phone contract. 

Let X = the excess time used by one INDIVIDUAL cell phone customer 
who exceeds his contracted time allowance. 

DS Exp(z;) From Chapter 5, we know that = 22 ando = 22. 


Let X = the mean excess time used by a sample of n = 80 customers who 
exceed their contracted time allowance. 


xX ~ N(22, 22.) by the CLT for Sample Means 
Exercise: 

Problem: 

Using the CLT to find Probability: 


e aFind the probability that the mean excess time used by the 80 
customers in the sample is longer than 20 minutes. This is asking 
us to find P(a > 20) Draw the graph. 

¢ b Suppose that one customer who exceeds the time limit for his 
cell phone contract is randomly selected. Find the probability 
that this individual customer's excess time is longer than 20 
minutes. This is asking us to find P(a > 20) 

e c Explain why the probabilities in (a) and (b) are different. 


Solution: 
Part a. 


iehinvol: J24y = ZA0)) 


P(x > 20) = 0.7919 using normalcdf (20, 1E99, 22, 22.) 


The probability is 0.7919 that the mean excess time used is more than 
20 minutes, for a sample of 80 customers who exceed their contracted 
time allowance. 


p(x > 20) 


| 


20 22 


Note:1E99 = 10°°and-1E99 = —10””. Press the 
EE 


key for E. Or just use 10499 instead of 1E99. 


Part b. 
Find P(x>20) . Remember to use the exponential distribution for an 
individual: X~Exp(1/22). 


P(X>20) = e\(-(1/22)*20) or e\(—.04545*20) = 0.4029 
Part c. Explain why the probabilities in (a) and (b) are different. 


¢ P(x > 20) = 0.4029 but P(x > 20) = 0.7919 

e The probabilities are not equal because we use different 
distributions to calculate the probability for individuals and for 
means. 

e When asked to find the probability of an individual value, use the 
stated distribution of its random variable; do not use the CLT. 
Use the CLT with the normal distribution when you are being 
asked to find the probability for an mean. 


Exercise: 


Problem: 


Using the CLT to find Percentiles: 

Find the 95th percentile for the sample mean excess time for samples 
of 80 customers who exceed their basic contract time allowances. 
Draw a graph. 


Solution: 


Let k = the 95th percentile. Find k where P(x < k) = 0.95 


a =, 22) _ 
k = 26.0 using invNorm(.95, 22, 22.) — 26.0 


P(x <k)= 0.95 


x 
22 k 


The 95th percentile for the sample mean excess time used is about 
26.0 minutes for random samples of 80 customers who exceed their 
contractual allowed time. 


95% of such samples would have means under 26 minutes; only 5% 
of such samples would have means above 26 minutes. 


Note:(HISTORICAL): Normal Approximation to the Binomial 


Historically, being able to compute binomial probabilities was one of the 
most important applications of the Central Limit Theorem. Binomial 
probabilities were displayed in a table in a book with a small value for n 
(say, 20). To calculate the probabilities with large values of n, you had to 
use the binomial formula which could be very complicated. Using the 
Normal Approximation to the Binomial simplified the process. To 
compute the Normal Approximation to the Binomial, take a simple random 
sample from a population. You must meet the conditions for a binomial 
distribution: 


e there are a certain number n of independent trials 
e the outcomes of any trial are success or failure 


¢ each trial has the same probability of a success p 


Recall that if X is the binomial random variable, then X~B(n, p). The 
shape of the binomial distribution needs to be similar to the shape of the 
normal distribution. To ensure this, the quantities np and nq must both be 
greater than five (np > 5 and nq > 5; the approximation is better if they 
are both greater than or equal to 10). Then the binomial can be 
approximated by the normal distribution with mean yz = np and standard 
deviation a = ,/npq. Remember that q = 1 — p. In order to get the best 
approximation, add 0.5 to x or subtract 0.5 from x (use x + 0.5 or x — 0.5 
). The number 0.5 is called the continuity correction factor. 


Example: 

Suppose in a local Kindergarten through 12th grade (K - 12) school 
district, 53 percent of the population favor a charter school for grades K - 
5. A simple random sample of 300 is surveyed. 


1. Find the probability that at least 150 favor a charter school. 

2. Find the probability that at most 160 favor a charter school. 

3. Find the probability that more than 155 favor a charter school. 
4. Find the probability that less than 147 favor a charter school. 
5. Find the probability that exactly 175 favor a charter school. 


Let X = the number that favor a charter school for grades K - 5. X~ 
B(n, p) where n =300 and p = 0.53. Since np > 5 and nq > 5, use the 
normal approximation to the binomial. The formulas for the mean and 
standard deviation are 4. = np and 0 = ,/npq. The mean is 159 and the 
standard deviation is 8.6447. The random variable for the normal 
distribution is Y. Y~N(159, 8.6447). See The Normal Distribution for 
help with calculator instructions. 

For Problem 1., you include 150 so P(x > 150) has normal 
approximation P(Y > 149.5) =0.8641. 

normalcdf (149.5, 10°99, 159, 8.6447) = 0.8641. 

For Problem 2., you include 160 so P(x < 160) has normal 
approximation P(Y < 160.5) =0.5689. 


normalcdf (0, 160.5, 159, 8.6447) = 0.5689 
For Problem 3., you exclude 155 so P(x > 155) has normal 
approximation P(y > 155.5) =0.6572. 
normalcdf (155.5, 10°99, 159, 8.6447) = 0.6572 
For Problem 4., you exclude 147 so P(x < 147) has normal 
approximation P(Y < 146.5) =0.0741. 
normalcdf (0, 146.5, 159, 8.6447) = 0.0741 
For Problem 5., P(a=175) has normal approximation 
VEN els ee 11) OO Fos}, 
normalcdf (174.5, 175.5, 159, 8.6447) = 0.0083 
Because of calculators and computer software that easily let you 
calculate binomial probabilities for large values of n, it is not necessary to 
use the the Normal Approximation to the Binomial provided you have 
access to these technology tools. Most school labs have Microsoft Excel, 
an example of computer software that calculates binomial probabilities. 
Many students have access to the TI-83 or 84 series calculators and they 
easily calculate probabilities for the binomial. In an Internet browser, if 
you type in "binomial probability distribution calculation," you can find at 
least one online calculator for the binomial. 
For Example 3, the probabilities are calculated using the binomial (n=300 
and p=0.53) below. Compare the binomial and normal distribution 
answers. See Discrete Random Variables for help with calculator 
instructions for the binomial. 
P(x > 150):1 - binomialcdf (300, 0.53, 149) =0.8641 

P(x < 160): binomialcdf (300, 0.53, 160) =0.5684 
Pe t55)1 - binomialcdT eno wss 5s) 06576 
P(a < 147): binomialcdf (300, 0.53, 146) =0.0742 
P(a=175): (You use the binomial pdf.) binomialpdf 
(175,,0'52, 146) 070083 


**Contributions made to Example 2 by Roberta Bloom 


Glossary 


Average 
A number that describes the central tendency of the data. There are a 
number of specialized averages, including the arithmetic mean, 
weighted mean, median, mode, and geometric mean. 


Central Limit Theorem 
Given a random variable (RV) with known mean p and known 
standard deviation o. We are sampling with size n and we are 
interested in two new RVs - the sample mean, X, and the sample sum, 
»/X. If the size n of the sample is sufficiently large, then X ~ 


N(n, =| and &.X ~ N(np, no). If the size n of the sample is 


sufficiently large, then the distribution of the sample means and the 
distribution of the sample sums will approximate a normal distribution 
regardless of the shape of the population. The mean of the sample 
means will equal the population mean and the mean of the sample 
sums will equal n times the population mean. The standard deviation 
of the distribution of the sample means, aia is called the standard 


error of the mean. 


Exponential Distribution 
A continuous random variable (RV) that appears when we are 
interested in the intervals of time between some random events, for 
example, the length of time between emergency arrivals at a hospital. 
Notation: X~Exp(m). The mean is ps = —- and the standard deviation 


is 0 = —. The probability density function is f(x) = me ™, x > 0 
and the cumulative distribution function is P(X < «) =1—e"™. 


Mean 
A number that measures the central tendency. A common name for 
mean is 'average.' The term 'mean' is a shortened form of ‘arithmetic 


mean.’ By definition, the mean for a sample (denoted by 2) is 
Sum of all values in the sample dth f lati 
Number of values in the sample ’ an See Or popiawon 


: __ Sum of all values in the population 
(denoted by Lt) 1S & = Number of values in the population ° 


C= 


Uniform Distribution 


A continuous random variable (RV) that has equally likely outcomes 
over the domain, a < x < b. Often referred as the Rectangular 
distribution because the graph of the pdf has the form of a rectangle. 


Notation: X~U(a,b). The mean is p = ath and the standard deviation 
iso = = Y The probability density function is f(a) = - for 


a<2a<bora<z< b. The cumulative distribution is 


P(X <a) = $2, 


Summary of Formulas 
Formula 
Central Limit Theorem for Sample Means 


X~N(px, 2) The Mean (x): [Lx 
Formula 


Central Limit Theorem for Sample Means Z-Score and Standard Error of 
the Mean 


z= Standard Error of the Mean (Standard Deviation 


Formula 
Central Limit Theorem for Sums 


uX~ N|(n)-ux,/n-ox| Mean for Sums (DX): n- ux 
Formula 
Central Limit Theorem for Sums Z-Score and Standard Deviation for Sums 


Ux—n- px 


VMox 


= Standard Deviation for Sums (»X): J/n-ox 


Practice: The Central Limit Theorem 


Student Learning Outcomes 


e The student will calculate probabilities using the Central Limit 
Theorem. 


Given 


Yoonie is a personnel manager in a large corporation. Each month she must 
review 16 of the employees. From past experience, she has found that the 
reviews take her approximately 4 hours each to do with a population 
standard deviation of 1.2 hours. Let X be the random variable representing 
the time it takes her to complete one review. Assume X is normally 
distributed. Let X be the random variable representing the mean time to 
complete the 16 reviews. Let 4X be the total time it takes Yoonie to 
complete all of the month’s reviews. Assume that the 16 reviews represent a 
random set of reviews. 


Distribution 
Complete the distributions. 


le. ee 
Die i 
Be eee 


Graphing Probability 
For each problem below: 
e a Sketch the graph. Label and scale the horizontal axis. Shade the 
region corresponding to the probability. 


e b Calculate the value. 


Exercise: 


Problem: 


Find the probability that one review will take Yoonie from 3.5 to 4.25 
hours. 


ea 


: ye 


Solution: 
e b3.5, 4.25, 0.2441 
Exercise: 
Problem: 


Find the probability that the mean of a month’s reviews will take 
Yoonie from 3.5 to 4.25 hrs. 


ea 


4 | 


© b P( = 


Solution: 
e b 0.7499 
Exercise: 
Problem: 


Find the 95th percentile for the mean time to complete one month’s 
reviews. 


ea 


i | 


e bThe 95th Percentile= 


Solution: 
e b 4.49 hours 
Exercise: 
Problem: 


Find the probability that the sum of the month’s reviews takes Yoonie 
from 60 to 65 hours. 


ea 


¢ b The Probability= 


Solution: 
e b 0.3802 


Exercise: 


Problem: Find the 95th percentile for the sum of the month’s reviews. 


ea 


e b The 95th percentile= 


Solution: 


¢ b 71.90 


Discussion Question 


Exercise: 


Problem: What causes the probabilities in [link] and [link] to differ? 


Homework 

The Central Limit Theorem: Homework is part of the collection col10555 
written by Barbara Illowsky and Susan Dean. 

Exercise: 


Problem: 


X ~ N(60,9). Suppose that you form random samples of 25 from this 

distribution. Let X be the random variable of averages. Let 3X be the 
random variable of sums. For c - f, sketch the graph, shade the region, 

label and scale the horizontal axis for X, and find the probability. 


¢ a Sketch the distributions of X and X on the same graph. 
ebx~ 

© eP(z <= 60) = 

e d Find the 30th percentile for the mean. 

e e P(56 < & < 62) = 

efPUS.< 7 < 538) = 

° g Ux~ 

e h Find the minimum value for the upper quartile for the sum. 
i P(1400' = Se < 1550) = 


Solution: 


¢ b Xbar~N(60,—_) 


, /25 
c0.5000 
d59.06 
e0.8536 
£0.1333 
h1530.35 
10.8536 


Exercise: 


Problem: 


Determine which of the following are true and which are false. Then, 
in complete sentences, justify your answers. 


e a When the sample size is large, the mean of X is approximately 
equal to the mean of X. 

¢ b When the sample size is large, X is approximately normally 
distributed. 

¢ c When the sample size is large, the standard deviation of X is 
approximately the same as the standard deviation of X. 


Exercise: 


Problem: 


The percent of fat calories that a person in America consumes each 
day is normally distributed with a mean of about 36 and a standard 
deviation of about 10. Suppose that 16 individuals are randomly 
chosen. 


Let X =average percent of fat calories. 


eaX~- ( : 

¢ b For the group of 16, find the probability that the average 
percent of fat calories consumed is more than 5. Graph the 
situation and shade in the area to be determined. 

e c Find the first quartile for the average percent of fat calories. 


Solution: 
f _10_ 
a N(36, ag) 
e bl 
e € 34.31 


Exercise: 


Problem: 


Previously, De Anza statistics students estimated that the amount of 
change daytime statistics students carry is exponentially distributed 
with a mean of $0.88. Suppose that we randomly pick 25 daytime 
Statistics students. 


a In words, X = 

b X~ 

c In words, X = 

d X- ( ) 

e Find the probability that an individual had between $0.80 and 
$1.00. Graph the situation and shade in the area to be determined. 
f Find the probability that the average of the 25 students was 
between $0.80 and $1.00. Graph the situation and shade in the 
area to be determined. 

g Explain the why there is a difference in (e) and (f). 


Exercise: 


Problem: 


Suppose that the distance of fly balls hit to the outfield (in baseball) is 
normally distributed with a mean of 250 feet and a standard deviation 
of 50 feet. We randomly sample 49 fly balls. 


a If X = average distance in feet for 49 fly balls, then X ~ 


b What is the probability that the 49 balls traveled an average of 


less than 240 feet? Sketch the graph. Scale the horizontal axis for 
X. Shade the region corresponding to the probability. Find the 
probability. 

c Find the 80th percentile of the distribution of the average of 49 
fly balls. 


Solution: 


: 50 
a N(250,) 


e b 0.0808 
e ¢c256.01 feet 


Exercise: 


Problem: 


Suppose that the weight of open boxes of cereal in a home with 
children is uniformly distributed from 2 to 6 pounds. We randomly 
survey 64 homes with children. 


e alin words, X = 

b X~ 

C Ux = 

do x= 

e In words, XX = 

f X~ 

e g Find the probability that the total weight of open boxes is less 
than 250 pounds. 

e h Find the 35th percentile for the total weight of open boxes of 
cereal. 


Exercise: 


Problem: 


Suppose that the duration of a particular type of criminal trial is known 
to have a mean of 21 days and a standard deviation of 7 days. We 
randomly sample 9 trials. 


e alnwords, 1) X = 

© bxyX~ 

e c Find the probability that the total length of the 9 trials is at least 
225 days. 

e d90 percent of the total of 9 of these types of trials will last at 
least how long? 


Solution: 


e aThe total length of time for 9 criminal trials 
¢ b N(189,21) 

e ¢c 0.0432 

e d 162.09 


Exercise: 


Problem: 


According to the Internal Revenue Service, the average length of time 
for an individual to complete (record keep, learn, prepare, copy, 
assemble and send) IRS Form 1040 is 10.53 hours (without any 
attached schedules). The distribution is unknown. Let us assume that 
the standard deviation is 2 hours. Suppose we randomly sample 36 
taxpayers. 


e aln words, X = 

¢ bInwords, X = 

°cX- 

e d Would you be surprised if the 36 taxpayers finished their Form 
1040s in an average of more than 12 hours? Explain why or why 
not in complete sentences. 

e e Would you be surprised if one taxpayer finished his Form 1040 
in more than 12 hours? In a complete sentence, explain why. 


Exercise: 
Problem: 
Suppose that a category of world class runners are known to run a 
marathon (26 miles) in an average of 145 minutes with a standard 


deviation of 14 minutes. Consider 49 of the races. 


Let X = the average of the 49 races. 


eaX~- 

e b Find the probability that the runner will average between 142 
and 146 minutes in these 49 marathons. 

e c Find the 80th percentile for the average of these 49 marathons. 

e d Find the median of the average running times. 


Solution: 
14 
e b 0.6247 
e c 146.68 


e d 145 minutes 


Exercise: 


Problem: 


The attention span of a two year-old is exponentially distributed with a 
mean of about 8 minutes. Suppose we randomly survey 60 two year- 
olds. 


e alnwords, X = 

° bxX-~ 

e cIn words, X = 

ed X- 

e e Before doing any calculations, which do you think will be 
higher? Explain why. 


© ithe probability that an individual attention span is less than 
10 minutes; or 

© ii the probability that the average attention span for the 60 
children is less than 10 minutes? Why? 


e f Calculate the probabilities in part AG): 
e g Explain why the distribution for X is not exponential. 


Exercise: 


Problem: 


Suppose that the length of research papers is uniformly distributed 
from 10 to 25 pages. We survey a class in which 55 research papers 
were turned in to a professor. The 55 research papers are considered a 
random collection of all papers. We are interested in the average length 
of the research papers. 


a In words, X = 

b X~ 

C Ux = 

do x= 

e In words, X = 

f X~ 

g In words, 7X = 

h )X ~ 

i Without doing any calculations, do you think that it’s likely that 
the professor will need to read a total of more than 1050 pages? 
Why? 

j Calculate the probability that the professor will need to read a 
total of more than 1050 pages. 

k Why is it so unlikely that the average length of the papers will 
be less than 12 pages? 


Solution: 


b U(10,25) 
c17.5 


ed 1/2 = 4.3301 


£N(17.5,0.5839) 
hN(962.5,32.11) 
j0.0032 


Exercise: 


Problem: 


The length of songs in a collector’s CD collection is uniformly 
distributed from 2 to 3.5 minutes. Suppose we randomly pick 5 CDs 
from the collection. There is a total of 43 songs on the 5 CDs. 


e alnwords, X = 

° b X- 

e clIn words, X = 

eo dX- 

e e Find the first quartile for the average song length. 

e f The IQR (interquartile range) for the average song length is 
from to 


Exercise: 


Problem: 


Salaries for teachers in a particular elementary school district are 
normally distributed with a mean of $44,000 and a standard deviation 
of $6500. We randomly survey 10 teachers from that district. 


e alin words, X = 

e b In words, X = 

°cX- 

e din words, :X = 

ee diX~ 

e f Find the probability that the teachers earn a total of over 
$400,000. 

e g Find the 90th percentile for an individual teacher’s salary. 

e h Find the 90th percentile for the average teachers’ salary. 

e ilf we surveyed 70 teachers instead of 10, graphically, how 
would that change the distribution for X ? 

e j If each of the 70 teachers received a $3000 raise, graphically, 
how would that change the distribution for X? 


Solution: 


2 c N(44,000, “*) 

¢ e N(440,000,(/10)(6500)) 
° £0.9742 

¢ g $52,330 

¢ h $46,634 


Exercise: 


Problem: 


The distribution of income in some Third World countries is 
considered wedge shaped (many very poor people, very few middle 
income people, and few to many wealthy people). Suppose we pick a 
country with a wedge distribution. Let the average salary be $2000 per 
year with a standard deviation of $8000. We randomly survey 1000 
residents of that country. 


e alin words, X = 

e bIn words, X = 

e Cc Xe 

¢ d How is it possible for the standard deviation to be greater than 
the average? 

e Why is it more likely that the average of the 1000 residents will 
be from $2000 to $2100 than from $2100 to $2200? 


Exercise: 
Problem: 
The average length of a maternity stay in a U.S. hospital is said to be 


2.4 days with a standard deviation of 0.9 days. We randomly survey 80 
women who recently bore children in a U.S. hospital. 


e alnwords, X = 
e bIn words, X = 


c X~ 

d In words, 3X = 

e LX ~ 

f Is it likely that an individual stayed more than 5 days in the 

hospital? Why or why not? 

e g Is it likely that the average stay for the 80 women was more 
than 5 days? Why or why not? 

e h Which is more likely: 


© jan individual stayed more than 5 days; or 
o jithe average stay of 80 women was more than 5 days? 


e i lf we were to sum up the women’s stays, is it likely that, 
collectively they spent more than a year in the hospital? Why or 
why not? 


Solution: 
- 0.9 
c N(2.4, Ja 
e eNV(192,8.05) 
e hindividual 


Exercise: 


Problem: 


In 1940 the average size of a U.S. farm was 174 acres. Let’s say that 
the standard deviation was 55 acres. Suppose we randomly survey 38 
farmers from 1940. (Source: U.S. Dept. of Agriculture) 


e aln words, X = 

¢ bIn words, X = 

°“cX- 

e d The IQR for X is from acres to acres. 


Exercise: 


Problem: 


The stock closing prices of 35 U.S. semiconductor manufacturers are 
given below. (Source: Wall Street Journal) 


8.625 30.25 27.625 46.75 32.875 18.25 5 0.125 2.9375 6.875 28.25 
24,25.21.1.5 30.25.71 43.5 49.25:2:5629'31 16.5.9:5 18.918: 9:10:5 
16,625°1.25°18:12.875 7.12875 2,875 '60.29:29.25 


a In words, X = 
b 


°c iz 
rolbay | ee 
°o in = 


c Construct a histogram of the distribution of the averages. Start 
at x = —0.0005. Make bar widths of 10. 

d In words, describe the distribution of stock prices. 

e Randomly average 5 stock prices together. (Use a random 
number generator.) Continue averaging 5 pieces together until 
you have 10 averages. List those 10 averages. 

f Use the 10 averages from (e) to calculate: 


10) iz= 
ons; = 


g Construct a histogram of the distribution of the averages. Start 
at = —0.0005. Make bar widths of 10. 

h Does this histogram look like the graph in (c)? 

i In 1 - 2 complete sentences, explain why the graphs either look 
the same or look different? 

j Based upon the theory of the Central Limit Theorem, X ~ 


Solution: 


b$20.71; $17.31; 35 


¢ d Exponential distribution, X ~ Exp(1/20.71) 
e £ $20.71; $11.14 


e j N(20.71, at) 


Exercise: 


Problem: 


Use the Initial Public Offering data (see “Table of Contents) to do this 
problem. 


e alnwords, X = 
« b 


(e) iwx = 
ie) liox = 
o in = 


e c Construct a histogram of the distribution. Start at = —0.50. 
Make bar widths of $5. 

e dIn words, describe the distribution of stock prices. 

e e Randomly average 5 stock prices together. (Use a random 
number generator.) Continue averaging 5 pieces together until 
you have 15 averages. List those 15 averages. 

e f Use the 15 averages from (e) to calculate the following: 


°oiz= 
ons, = 


¢ g Construct a histogram of the distribution of the averages. Start 
at x = —0.50. Make bar widths of $5. 

e h Does this histogram look like the graph in (c)? Explain any 
differences. 

e iln 1-2 complete sentences, explain why the graphs either look 
the same or look different? 

e j Based upon the theory of the Central Limit Theorem, X ~ 


Try these multiple choice questions (Exercises19 - 23). 


The next two questions refer to the following information: The time to 
wait for a particular rural bus is distributed uniformly from 0 to 75 minutes. 
100 riders are randomly sampled to learn how long they waited. 

Exercise: 


Problem: 


The 90th percentile sample average wait time (in minutes) for a sample 
of 100 riders is: 


e A 315.0 
¢ B40.3 
e C 38.5 
e D 65.2 


Solution: 


B 
Exercise: 


Problem: 


Would you be surprised, based upon numerical calculations, if the 
sample average wait time (in minutes) for 100 riders was less than 30 
minutes? 


e A Yes 
¢ BNo 
e C There is not enough information. 


Solution: 


A 


Exercise: 


Problem: 


Which of the following is NOT TRUE about the distribution for 
averages? 


e A The mean, median and mode are equal 
e B The area under the curve is one 

e C The curve never touches the x-axis 

e D The curve is skewed to the right 


Solution: 


D 


The next three questions refer to the following information: The cost of 
unleaded gasoline in the Bay Area once followed an unknown distribution 
with a mean of $4.59 and a standard deviation of $0.10. Sixteen gas stations 
from the Bay Area are randomly chosen. We are interested in the average 
cost of gasoline for the 16 gas stations. 

Exercise: 


Problem: 
The distribution to use for the average cost of gasoline for the 16 gas 
stations is 

e AX ~ N(4.59, 0.10) 

. BX~ N(4.59, 2.0.) 

°CX~N 


e DX ~ N(4.59 


(4.59, “is 
( 


nn) 


Solution: 


B 


Exercise: 
Problem: 


What is the probability that the average price for 16 gas stations is over 
$4.69? 


e A Almost zero 
¢ B0.1587 

¢ € 0.0943 

e D Unknown 


Solution: 


A 
Exercise: 
Problem: 


Find the probability that the average price for 30 gas stations is less 
than $4.55. 


e A0.6554 
¢ B0.3446 
e €0.0142 
e D0.9858 
e E0 


Solution: 


C 


Exercise: 


Problem: 


For the Charter School Problem (Example 6) in Central Limit 
Theorem: Using the Central Limit Theorem, calculate the following 
using the normal approximation to the binomial. 


e A Find the probability that less than 100 favor a charter school for 
grades K - 5. 

¢ B Find the probability that 170 or more favor a charter school for 
grades K - 5. 

e C Find the probability that no more than 140 favor a charter 
school for grades K - 5. 

¢ D Find the probability that there are fewer than 130 that favor a 
charter school for grades K - 5. 

e E Find the probability that exactly 150 favor a charter school for 
grades K - 5. 


If you either have access to an appropriate calculator or computer 
software, try calculating these probabilities using the technology. Try 
also using the suggestion that is at the bottom of Central Limit 
Theorem: Using the Central Limit Theorem for finding a website 
that calculates binomial probabilities. 


Solution: 


e C 0.0162 
e E 0.0268 


Exercise: 


Problem: 


Four friends, Janice, Barbara, Kathy and Roberta, decided to carpool 
together to get to school. Each day the driver would be chosen by 
randomly selecting one of the four names. They carpool to school for 
96 days. Use the normal approximation to the binomial to calculate the 
following probabilities. Round the standard deviation to 4 decimal 
places. 


e A Find the probability that Janice is the driver at most 20 days. 

e B Find the probability that Roberta is the driver more than 16 
days. 

e C Find the probability that Barbara drives exactly 24 of those 96 
days. 


If you either have access to an appropriate calculator or computer 
software, try calculating these probabilities using the technology. Try 
also using the suggestion that is at the bottom of Central Limit 
Theorem: Using the Central Limit Theorem for finding a website 
that calculates binomial probabilities. 


Solution: 


¢ A 0.2047 
e B0.9615 
e C 0.0938 


**Fxercise 24 contributed by Roberta Bloom 


Review 

Central Limit Theorem: Review is part of the collection col10555 written 
by Barbara Illowsky and Susan Dean. The module consists of review 
exercises. 


The next three questions refer to the following information: Richard’s 
Furniture Company delivers furniture from 10 A.M. to 2 P.M. continuously 
and uniformly. We are interested in how long (in hours) past the 10 A.M. 
Start time that individuals wait for their delivery. 

Exercise: 


Problem: X ~ 


e AU(0,4) 
* BU(10,2) 
e C Exp(2) 
D N(2,1) 


Solution: 


A 


Exercise: 


Problem: The average wait time is: 


e Ai hour 
e B2 hour 
e C 2.5 hour 
¢ D 4 hour 


Solution: 


B 


Exercise: 


Problem: 


Suppose that it is now past noon on a delivery day. The probability that 
a person must wait at least 1 5 more hours is: 


> 


ee 
eo IY a BR > > 
o0| cos] cot] 


Solution: 


A 


Exercise: 


Problem: Given: X~Exp(+). 


e aFind P(z > 1) 
e b Calculate the minimum value for the upper quartile. 


¢ c Find P(z = $) 


Solution: 


e a0.7165 
e b4.16 
e c0 


Exercise: 
Problem: 
e 40% of full-time students took 4 years to graduate 


e 30% of full-time students took 5 years to graduate 
e 20% of full-time students took 6 years to graduate 


e 10% of full-time students took 7 years to graduate 


The expected time for full-time students to graduate is: 


e A4 years 
e B 4.5 years 
e C5 years 
e D5.5 years 


Solution: 


C 
Exercise: 


Problem: 


Which of the following distributions is described by the following 
example? 


Many people can run a short distance of under 2 miles, but as the 
distance increases, fewer people can run that far. 


e A Binomial 

e B Uniform 

e C Exponential 
e D Normal 


Solution: 


G 


Exercise: 


Problem: 


The length of time to brush one’s teeth is generally thought to be 
exponentially distributed with a mean of & minutes. Find the 
probability that a randomly selected person brushes his/her teeth less 
than a minutes. 


e AO.5 
3 

e Bs 

¢ C0.43 

¢ D0.63 


Solution: 


D 
Exercise: 


Problem: 
Which distribution accurately describes the following situation? 


The chance that a teenage boy regularly gives his mother a kiss 
goodnight (and he should!!) is about 20%. Fourteen teenage boys are 
randomly surveyed. 


X =the number of teenage boys that regularly give their mother a kiss 
goodnight 


¢ A B(14,0.20) 
© BP(2.8) 
© C N(2.8,2.24) 


° DExp( eT ) 


Solution: 


A 
Exercise: 


Problem: 
Which distribution accurately describes the following situation? 


A 2008 report on technology use states that approximately 20 percent 
of U.S. households have never sent an e-mail. (source: 

http://www. webguild.org/2008/05/20-percent-of-americans-have- 
never-used-email.php) Suppose that we select a random sample of 
fourteen U.S. households . 


X =the number of households in a 2008 sample of 14 households that 
have never sent an email 


¢ A B(14,0.20) 
© BP(2.8) 
© C N(2.8,2.24) 


¢ D Exp( 750 ) 


Solution: 


A 


**Exercise 9 contributed by Roberta Bloom 


Dice lab 
Using minitab, students will explore the central limit theorem. 


Central Limit Theorem Lab 


Name: 


Student Learning Outcomes: 


The student will examine properties of the Central Limit Theorem. 


Theoretical Distribution: Tossing a Single Die 

1. Is the distribution continuous or discrete? 

2. In words, state the theoretical distribution for the tossing of a single die? 
xX~ 

3. Calculate p = 

(indicate the formula) 

4. Calculate o = 


(indicate the formula) 


Collect the Data 


1. Using the random number generator in Minitab, simulate the tossing of a 
single die 360 times. 


Calc -> Random Data -> Integer. 


2. Construct a histogram using Minitab and include it with this lab. Be sure 
that you label the graph and axis. 


3. Calculate the following: (include the session window) 
L = S= n= 1 (single die) 


4. Draw a smooth curve through the tops of the bars of the histogram. Use 1 
— 2 complete sentences, describe the general shape of the curve. 


Collecting Averages of Pairs 


Repeat step #1 (of the section above titled "Collect the Data") with one 
exception. Instead of recording the value of a single die, record the average 
of two dice. Use Minitab and generate 360 rows with two columns. Then 
use the Calc -> Row Statistics and select mean. This will give you a column 
with the ‘average’ of the two tosses. 


1. Construct a histogram. Scale the axes using the same scaling and labeling 
for the axis as you did for the first graph and include it with this lab. (This 
is done by selecting both the original data column and the mean2 column to 
graph. Then select multiple graphs — separate graphs — same x-scale. This 
will redraw your first graph, however the computer needs that graph as a 
reference to draw the second.) 


2. Calculate the following: 
xt = S= n = 2 (a pair of dice) 


3. Draw a smooth curve through tops of the bars of the histogram. Use 1 — 2 
complete sentences, describe the general shape of the curve. 


Collecting Averages of Groups of Five 


Repeat step #1 (of the section above titled "Collect the Data") with one 
exception. Instead of recording the value of a single die, record the average 
of five dice. Use Minitab and generate 360 rows with five columns. Then 
use the Calc -> Row Statistics and select mean. This will give you a column 
with the ‘average’ of the five tosses. 


1. Construct a histogram. Scale the axes using the same scaling and labeling 
for the axis as you did for the first graph and include it with this lab. (Again 
you need to graph both the first data set with this as multiple graphs.) 


2. Calculate the following: 
x= S= n=5 (5 dice) 
3. Draw a smooth curve through tops of the bars of the histogram. Use 1 — 2 


complete sentences, describe the general shape of the curve. 


Collecting Averages of Groups of 20 


Repeat step #1 (of the section above titled "Collect the Data") with one 
exception. Instead of recording the value of a single die, record the average 
of two dice. Use Minitab and generate 360 rows with twenty columns. Then 
use the Calc -> Row Statistics and select mean. This will give you a column 
with the ‘average’ of the twenty tosses. 


1. Construct a histogram. Scale the axes using the same scaling and labeling 
for the axis as you did for the first graph and include it with this lab. (Again 
you need to graph both the first data set with this as multiple graphs.) 


2. Calculate the following: 
x= S= n = 20 (20 dice) 


3. Draw a smooth curve through tops of the bars of the histogram. Use 1 — 2 
complete sentences, describe the general shape of the curve. 


Discussion Questions: 


1. As n (the number of dice being averaged) changed, how did the shape of 
the graph change? (Be sure you have the same x axis for all graphs or you 
will not be able to correctly answer this question.) Use 1 — 2 complete 
sentences to explain what happened. 


2. Looking at the values for the sample mean and standard deviation, 
explain what happens as n increases. Use 1 — 2 complete sentences to 
explain what happened. 


3. State the Central Limit Theorem. How do your simulations demonstrate 
this theorem? Use 3 — 4 complete sentences 


Confidence Intervals 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


¢ Calculate and interpret confidence intervals for one population mean 
and one population proportion. 

e Interpret the student-t probability distribution as the sample size 
changes. 

e Discriminate between problems applying the normal and the student-t 
distributions. 


Introduction 


Suppose you are trying to determine the mean rent of a two-bedroom 
apartment in your town. You might look in the classified section of the 
newspaper, write down several rents listed, and average them together. You 
would have obtained a point estimate of the true mean. If you are trying to 
determine the percent of times you make a basket when shooting a 
basketball, you might count the number of shots you make and divide that 
by the number of shots you attempted. In this case, you would have 
obtained a point estimate for the true proportion. 


We use sample data to make generalizations about an unknown population. 
This part of statistics is called inferential statistics. The sample data help 
us to make an estimate of a population parameter. We realize that the 
point estimate is most likely not the exact value of the population 
parameter, but close to it. After calculating point estimates, we construct 
confidence intervals in which we believe the parameter lies. 


In this chapter, you will learn to construct and interpret confidence 
intervals. You will also learn a new distribution, the Student's-t, and how it 
is used with these intervals. Throughout the chapter, it is important to keep 
in mind that the confidence interval is a random variable. It is the parameter 
that is fixed. 


If you worked in the marketing department of an entertainment company, 
you might be interested in the mean number of compact discs (CD's) a 
consumer buys per month. If so, you could conduct a survey and calculate 
the sample mean, Z, and the sample standard deviation, s. You would use 
to estimate the population mean and s to estimate the population standard 
deviation. The sample mean, 2%, is the point estimate for the population 
mean, p. The sample standard deviation, s, is the point estimate for the 
population standard deviation, o. 


Each of #% and s is also called a statistic. 


A confidence interval is another type of estimate but, instead of being just 
one number, it is an interval of numbers. The interval of numbers is a range 
of values calculated from a given set of sample data. The confidence 
interval is likely to include an unknown population parameter. 


Suppose for the CD example we do not know the population mean p but we 
do know that the population standard deviation is o = 1 and our sample 
size is 100. Then by the Central Limit Theorem, the standard deviation for 
the sample mean is 


Co 1 
Vn —_ “V100 — 0.1. 
The Empirical Rule, which applies to bell-shaped distributions, says that in 
approximately 95% of the samples, the sample mean, 2%, will be within two 
standard deviations of the population mean jz. For our CD example, two 
standard deviations is (2)(0.1) = 0.2. The sample mean 7 is likely to be 
within 0.2 units of p. 


Because % is within 0.2 units of 44, which is unknown, then yp is likely to be 
within 0.2 units of Z in 95% of the samples. The population mean pu is 
contained in an interval whose lower number is calculated by taking the 
sample mean and subtracting two standard deviations ((2)(0.1)) and whose 
upper number is calculated by taking the sample mean and adding two 
standard deviations. In other words, pz is between x — 0.2 and z + 0.2 in 
95% of all the samples. 


For the CD example, suppose that a sample produced a sample mean x = 2 
. Then the unknown population mean yp is between 


£=0,2=2=02 = Léande+0.2= 2+ 0.2 = 2.2 


We say that we are 95% confident that the unknown population mean 
number of CDs is between 1.8 and 2.2. The 95% confidence interval is 
(1.8, 2.2). 


The 95% confidence interval implies two possibilities. Either the interval 
(1.8, 2.2) contains the true mean yp or our sample produced an Z that is not 
within 0.2 units of the true mean pw. The second possibility happens for only 
5% of all the samples (100% - 95%). 


Remember that a confidence interval is created for an unknown population 
parameter like the population mean, jz. Confidence intervals for some 
parameters have the form 


(point estimate - margin of error, point estimate + margin of error) 


The margin of error depends on the confidence level or percentage of 
confidence. 


When you read newspapers and journals, some reports will use the phrase 
"margin of error." Other reports will not use that phrase, but include a 
confidence interval as the point estimate + or - the margin of error. These 
are two ways of expressing the same concept. 


Note: Although the text only covers symmetric confidence intervals, there 
are non-symmetric confidence intervals (for example, a confidence interval 
for the standard deviation). 


Optional Collaborative Classroom Activity 


Have your instructor record the number of meals each student in your class 
eats out in a week. Assume that the standard deviation is known to be 3 
meals. Construct an approximate 95% confidence interval for the true mean 
number of meals students eat out each week. 


1. Calculate the sample mean. 
2.0 = 3 and n = the number of students surveyed. 


| ape 
3. Construct the interval (z 2 Fare +2 Fi ) 
We say we are approximately 95% confident that the true average number 
of meals that students eat out in a week is between and 


Glossary 


Confidence Interval (CI) 
An interval estimate for an unknown population parameter. This 
depends on: 


e The desired confidence level. 

e Information that is known about the distribution (for example, 
known standard deviation). 

e The sample and its size. 


Inferential Statistics 
Also called statistical inference or inductive statistics. This facet of 
Statistics deals with estimating a population parameter based on a 
sample statistic. For example, if 4 out of the 100 calculators sampled 
are defective we might infer that 4 percent of the production is 
defective. 


Parameter 
A numerical characteristic of the population. 


Point Estimate 
A single number computed from a sample and used to estimate a 
population parameter. 


Confidence Interval, Single Population Mean, Population Standard 
Deviation Known, Normal 

Confidence Intervals: Confidence Interval, Single Population Mean, 
Population Standard Deviation Known, Normal is part of the collection 
col10555 written by Barbara Illowsky and Susan Dean with contributions 
from Roberta Bloom. 


Calculating the Confidence Interval 


To construct a confidence interval for a single unknown population mean pu 
, where the population standard deviation is known, we need z as an 
estimate for jz and we need the margin of error. Here, the margin of error is 
called the error bound for a population mean (abbreviated EBM). The 
sample mean z is the point estimate of the unknown population mean pu 
The confidence interval estimate will have the form: 


e (point estimate - error bound, point estimate + error bound) or, in 
symbols,(z — EBM, z + EBM) 


The margin of error depends on the confidence level (abbreviated CL). The 
confidence level is often considered the probability that the calculated 
confidence interval estimate will contain the true population parameter. 
However, it is more accurate to state that the confidence level is the percent 
of confidence intervals that contain the true population parameter when 
repeated samples are taken. Most often, it is the choice of the person 
constructing the confidence interval to choose a confidence level of 90% or 
higher because that person wants to be reasonably certain of his or her 
conclusions. 


There is another probability called alpha (@). @ is related to the confidence 
level CL. q@ is the probability that the interval does not contain the unknown 
population parameter. 

Mathematically, a + CL = 1. 


Example: 


e Suppose we have collected data from a sample. We know the sample 
mean but we do not know the mean for the entire population. 
e The sample mean is 7 and the error bound for the mean is 2.5. 


2 = 7 andih Bh = 25. 

The confidence interval is (7 — 2.5, 7 + 2.5); calculating the values gives 
(AL D5). 

If the confidence level (CL) is 95%, then we say that "We estimate with 
95% confidence that the true value of the population mean is between 4.5 
and 9.5." 


A confidence interval for a population mean with a known standard 
deviation is based on the fact that the sample means follow an 
approximately normal distribution. Suppose that our sample has a mean of 
x = 10 and we have constructed the 90% confidence interval (5, 15) where 
EBM = 5. 


To get a 90% confidence interval, we must include the central 90% of the 
probability of the normal distribution. If we include the central 90%, we 
leave out a total of a = 10% in both tails, or 5% in each tail, of the normal 
distribution. 


Confidence Level (CL) = 0.90 


x= 10 
EBM = 5 
x = —— = 5 
P 10 ” x+EBM = 15 


pt is believed to be in the interval (5, 15) with 90% confidence. 


To capture the central 90%, we must go out 1.645 "standard deviations" on 
either side of the calculated sample mean. 1.645 is the z-score from a 


Standard Normal probability distribution that puts an area of 0.90 in the 
center, an area of 0.05 in the far left tail, and an area of 0.05 in the far right 
tail. 


It is important that the "standard deviation" used must be appropriate for the 
parameter we are estimating. So in this section, we need to use the standard 
deviation that applies to sample means, which is ae Ta is commonly 

called the "standard error of the mean" in order to clearly distinguish the 
standard deviation for a mean from the population standard deviation o. 


In summary, as a result of the Central Limit Theorem: 


e X is normally distributed, that is, X ~ N(x Xx, Fir ) 


e When the population standard deviation o is known, we use a 
Normal distribution to calculate the error bound. 


Calculating the Confidence Interval: 

To construct a confidence interval estimate for an unknown population 
mean, we need data from a random sample. The steps to construct and 
interpret the confidence interval are: 


¢ Calculate the sample mean x from the sample data. Remember, in this 
section, we already know the population standard deviation o. 

e Find the Z-score that corresponds to the confidence level. 

e Calculate the error bound EBM 

e Construct the confidence interval 

e Write a sentence that interprets the estimate in the context of the 
situation in the problem. (Explain what the confidence interval means, 
in the words of the problem.) 


We will first examine each step in more detail, and then illustrate the 
process with some examples. 


Finding z for the stated Confidence Level 

When we know the population standard deviation o, we use a standard 
normal distribution to calculate the error bound EBM and construct the 
confidence interval. We need to find the value of z that puts an area equal to 


the confidence level (in decimal form) in the middle of the standard normal 
distribution Z~N(0,1). 


The confidence level, CL, is the area in the middle of the standard normal 
distribution. CL = 1 — a. So qa is the area that is split equally between the 
two tails. Each of the tails contains an area equal to > . 


The z-score that has an area to the right of 5 is denoted by zs 


For example, when CL = 0.95 then a = 0.05 and > = 0.025 ; we write 
£2 = £025 


The area to the right of 2.995 is 0.025 and the area to the left of z.95 is 1- 
0.025 = 0.975 


Za = 29.925 = 1.96, using a calculator, computer or a Standard Normal 
probability table. 


Using the TI83, TI83+ or TI84+ calculator: invNorm(0.975, 0,1) = 1.96 


CALCULATOR NOTE: Remember to use area to the LEFT of za ;in this 


chapter the last two inputs in the invNorm command are 0,1 because you 
are using a Standard Normal Distribution Z~N(0,1) 


EBM: Error Bound 
The error bound formula for an unknown population mean pz when the 
population standard deviation o is known is 


° EBM = zg - = 


Constructing the Confidence Interval 


e The confidence interval estimate has the format 
(2 — EBM,z + EBM). 


The graph gives a picture of the entire situation. 


Ch-F > 5 =Ch--a= 1. 


o. a 
ry CL=l-a 9 


x-EBM X x+EBM 


Writing the Interpretation 

The interpretation should clearly state the confidence level (CL), explain 
what population parameter is being estimated (here, a population mean), 
and should state the confidence interval (both endpoints). "We estimate with 
___% confidence that the true population mean (include context of the 
problem) is between and _____ (include appropriate units)." 


Example: 

Suppose scores on exams in statistics are normally distributed with an 
unknown population mean and a population standard deviation of 3 points. 
A random sample of 36 scores is taken and gives a sample mean (sample 
mean score) of 68. Find a confidence interval estimate for the population 
mean exam score (the mean score on all exams). 

Exercise: 


Problem: 


Find a 90% confidence interval for the true (population) mean of 
Statistics exam scores. 


Solution: 


e You can use technology to directly calculate the confidence 
interval 

e The first solution is shown step-by-step (Solution A). 

e The second solution uses the TI-83, 83+ and 84+ calculators 
(Solution B). 


Solution A 
To find the confidence interval, you need the sample mean, x, and the 
EBM. 


¢ ¢ = 3;n = 36; The confidence level is 90% (CL=0.90) 


CL—0900s07—1 — CE 1 — 090 — 0:10 


5 = 0.05 Za = 2.05 


The area to the right of 2,95 is 0.05 and the area to the left of z.95 is 
1-0.05=0.95 


using invNorm(0.95,0,1) on the TI-83,83+,84+ calculators. This can 
also be found using appropriate commands on other calculators, using 
a computer, or using a probability table for the Standard Normal 
distribution. 


_ (ee 
EBM = 1.645 ( +.) 0.8225 


o— EBM 68 — 08225 — 071i 
x + EBM = 68 + 0.8225 = 68.8225 


The 90% confidence interval is (67.1775, 68.8225). 
Solution B 


Using a function of the TI-83, TI-83+ or TI-84 calculators: 


Press STAT and arrow over to TESTS. 

Arrow down to 7:ZInterval. 

Press ENTER. 

Arrow to Stats and press ENTER. 

Arrow down and enter 3 for o, 68 for z , 36 for n, and .90 for C- Level. 
Arrow down to Calculate and press ENTER. 

The confidence interval is (to 3 decimal places) (67.178, 68.822). 
Interpretation 

We estimate with 90% confidence that the true population mean exam 
score for all statistics students is between 67.18 and 68.82. 
Explanation of 90% Confidence Level 

90% of all confidence intervals constructed in this way contain the true 
mean statistics exam score. For example, if we constructed 100 of these 
confidence intervals, we would expect 90 of them to contain the true 
population mean exam score. 


Changing the Confidence Level or Sample Size 


Example:Changing the Confidence Level 
Exercise: 


Problem: 

Suppose we change the original problem by using a 95% confidence 
level. Find a 95% confidence interval for the true (population) mean 
Statistics exam score. 


Solution: 


To find the confidence interval, you need the sample mean, x, and the 
EBM. 


¢e ¢ = 3;n = 36; The confidence level is 95% (CL=0.95) 
CL = 0.95 sow =1—-CL=1—0.95 = 0.05 


x = 0.025 ze = 2,025 
The area to the right of 2.925 is 0.025 and the area to the left of z.25 is 
1-0.025=0.975 


aed = £2,025 — 1.96 


using invnorm(.975,0,1) on the TI-83,83+,84+ calculators. (This can 
also be found using appropriate commands on other calculators, using 
a computer, or using a probability table for the Standard Normal 
distribution. ) 


ee ti ene ies 
EBM = 1.96 (+5) 0.98 


z—HBM = 68 — 0:98 = 67.02 


x + EBM = 68 + 0.98 = 68.98 


Interpretation 

We estimate with 95 % confidence that the true population mean for all 
Statistics exam scores is between 67.02 and 68.98. 

Explanation of 95% Confidence Level 

95% of all confidence intervals constructed in this way contain the true 
value of the population mean statistics exam score. 

Comparing the results 

The 90% confidence interval is (67.18, 68.82). The 95% confidence 
interval is (67.02, 68.98). The 95% confidence interval is wider. If you 
look at the graphs, because the area 0.95 is larger than the area 0.90, it 
makes sense that the 95% confidence interval is wider. 


0.025 0.95 0.025 
0.05 0.90 0.05 


Summary: Effect of Changing the Confidence Level 


e Increasing the confidence level increases the error bound, making the 
confidence interval wider. 

e Decreasing the confidence level decreases the error bound, making 
the confidence interval narrower. 


Example:Changing the Sample Size: 

Suppose we change the original problem to see what happens to the error 
bound if the sample size is changed. 

Exercise: 


Problem: 


Leave everything the same except the sample size. Use the original 
90% confidence level. What happens to the error bound and the 
confidence interval if we increase the sample size and use n=100 
instead of n=36? What happens if we decrease the sample size to 
n=25 instead of n=36? 


e x= 68 

- EBM = 22- (=) 

e o = 3; The confidence level is 90% (CL=0.90) ; 
za = 295 = 1.645 


Solution: 


If we increase the sample size n to 100, we decrease the error bound. 


i a (acon (ee Ce) 
When n = 100: EBM = zg - (2) = 1.645 - (—2- ) = 0.4935 


Solution: 


If we decrease the sample size n to 25, we increase the error bound. 


When 25 BN eee (=) = 1.645. (2) — 0.987 
5 Jn /25 


Summary: Effect of Changing the Sample Size 


e Increasing the sample size causes the error bound to decrease, making 
the confidence interval narrower. 

e Decreasing the sample size causes the error bound to increase, making 
the confidence interval wider. 


Working Backwards to Find the Error Bound or Sample Mean 


Working Bacwards to find the Error Bound or the Sample Mean 
When we calculate a confidence interval, we find the sample mean and 
calculate the error bound and use them to calculate the confidence interval. 
But sometimes when we read statistical studies, the study may state the 
confidence interval only. If we know the confidence interval, we can work 
backwards to find both the error bound and the sample mean. 

Finding the Error Bound 


e From the upper value for the interval, subtract the sample mean 
e OR, From the upper value for the interval, subtract the lower value. 
Then divide the difference by 2. 


Finding the Sample Mean 


e Subtract the error bound from the upper value of the confidence 
interval 
e OR, Average the upper and lower endpoints of the confidence interval 


Notice that there are two methods to perform each calculation. You can 
choose the method that is easier to use with the information you know. 


Example: 

Suppose we know that a confidence interval is (67.18, 68.82) and we want 
to find the error bound. We may know that the sample mean is 68. Or 
perhaps our source only gave the confidence interval and did not tell us the 
value of the the sample mean. 

Calculate the Error Bound: 


e If we know that the sample mean is 68: EBM = 68.82 — 68 = 0.82 


e If we don't know the sample mean: EBM = eee OZ 


Calculate the Sample Mean: 


e If we know the error bound: x = 68.82 — 0.82 = 68 


e If we don't know the error bound: z = ee) — 68 


Calculating the Sample Size n 


If researchers desire a specific margin of error, then they can use the error 
bound formula to calculate the required sample size. 


The error bound formula for a population mean when the population 


standard deviation is known is EBM = zz - Vn 
oes 22 : 
The formula for sample size isn = aye , found by solving the error 


bound formula for n 


In this formula, z is Za, corresponding to the desired confidence level. A 


researcher planning a study who wants a specified confidence level and 
error bound can use this formula to calculate the size of the sample needed 
for the study. 


Example: 

The population standard deviation for the age of Foothill College students 
is 15 years. If we want to be 95% confident that the sample mean age is 
within 2 years of the true population mean age of Foothill College students 
, how many randomly selected Foothill College students must be 
surveyed? 


e From the problem, we know that 0 = 15 and EBM=2 

° Z= Z 25 = 1.96, because the confidence level is 95%. 

n= a — LCs =216.09 using the sample size equation. 

e Use n = 217: Always round the answer UP to the next higher integer 
to ensure that the sample size is large enough. 


Therefore, 217 Foothill College students should be surveyed in order to be 


95% confident that we are within 2 years of the true population mean age 
of Foothill College students. 


**With contributions from Roberta Bloom 


Glossary 
Confidence Interval (CI) 
An interval estimate for an unknown population parameter. This 


depends on: 


e The desired confidence level. 


e Information that is known about the distribution (for example, 
known standard deviation). 
e The sample and its size. 


Confidence Level (CL) 
The percent expression for the probability that the confidence interval 
contains the true population parameter. For example, if the CL = 90% 
, then in 90 out of 100 samples the interval estimate will enclose the 
true population parameter. 


Error Bound for a Population Mean (EBM) 
The margin of error. Depends on the confidence level, sample size, and 
known or estimated population standard deviation. 


Confidence Interval, Single Population Mean, Standard Deviation 
Unknown, Student-T 

Confidence Interval, Single Population Mean, Population Standard 
Deviation Unknown, Student-t is part of the collection col10555 written by 
Barbara Illowsky and Susan Dean with contributions from Roberta Bloom. 


In practice, we rarely know the population standard deviation. In the past, 
when the sample size was large, this did not present a problem to 
statisticians. They used the sample standard deviation s as an estimate for 7 
and proceeded as before to calculate a confidence interval with close 
enough results. However, statisticians ran into problems when the sample 
size was small. A small sample size caused inaccuracies in the confidence 
interval. 


William S. Gossett (1876-1937) of the Guinness brewery in Dublin, Ireland 
ran into this problem. His experiments with hops and barley produced very 
few samples. Just replacing o with s did not produce accurate results when 
he tried to calculate a confidence interval. He realized that he could not use 
a normal distribution for the calculation; he found that the actual 
distribution depends on the sample size. This problem led him to "discover" 
what is called the Student's-t distribution. The name comes from the fact 
that Gosset wrote under the pen name "Student." 


Up until the mid 1970s, some statisticians used the normal distribution 
approximation for large sample sizes and only used the Student's-t 
distribution for sample sizes of at most 30. With the common use of 
graphing calculators and computers, the practice is to use the Student's-t 
distribution whenever s is used as an estimate for o. 


If you draw a simple random sample of size n from a population that has 
approximately a normal distribution with mean pw and unknown population 
x= 
Ca 
follow a Student's-t distribution with n — 1 degrees of freedom. The t- 
score has the same interpretation as the z-score. It measures how far z is 
from its mean py. For each sample size n, there is a different Student's-t 
distribution. 


standard deviation o and calculate the t-score t = , then the t-scores 


The degrees of freedom, m — 1, come from the calculation of the sample 
standard deviation s. In Chapter 2, we used n deviations (x — x values) to 
calculate s. Because the sum of the deviations is 0, we can find the last 
deviation once we know the other n — 1 deviations. The other n — 1 
deviations can change or vary freely. We call the number n — 1 the 
degrees of freedom (df). 

Properties of the Student's-t Distribution 


e The graph for the Student's-t distribution is similar to the Standard 
Normal curve. 

e The mean for the Student's-t distribution is 0 and the distribution is 
symmetric about 0. 

e The Student's-t distribution has more probability in its tails than the 
Standard Normal distribution because the spread of the t distribution is 
greater than the spread of the Standard Normal. So the graph of the 
Student's-t distribution will be thicker in the tails and shorter in the 
center than the graph of the Standard Normal distribution. 

e The exact shape of the Student's-t distribution depends on the "degrees 
of freedom". As the degrees of freedom increases, the graph Student's-t 
distribution becomes more like the graph of the Standard Normal 
distribution. 

e The underlying population of individual observations is assumed to be 
normally distributed with unknown population mean py and unknown 
population standard deviation o. The size of the underlying population 
is generally not relevant unless it is very small. If it is bell shaped 
(normal) then the assumption is met and doesn't need discussion. 
Random sampling is assumed but it is a completely separate 
assumption from normality. 


Calculators and computers can easily calculate any Student's-t probabilities. 
The TI-83,83+,84+ have a tcdf function to find the probability for given 
values of t. The grammar for the tcdf command is tcdf(lower bound, upper 
bound, degrees of freedom). However for confidence intervals, we need to 
use inverse probability to find the value of t when we know the probability. 


For the TI-84+ you can use the invl command on the DISTRibution menu. 
The invT command works similarly to the invnorm. The invT command 


requires two inputs: invT(area to the left, degrees of freedom) The output 
is the t-score that corresponds to the area we specified. 


The TI-83 and 83+ do not have the invT command. (The TI-89 has an 
inverse T command.) 


A probability table for the Student's-t distribution can also be used. The 
table gives t-scores that correspond to the confidence level (column) and 
degrees of freedom (row). (The TI-86 does not have an invT program or 
command, so if you are using that calculator, you need to use a probability 
table for the Student's-t distribution.) When using t-table, note that some 
tables are formatted to show the confidence level in the column headings, 
while the column headings in some tables may show only corresponding 
area in one or both tails. 


A Student's-t table (See the Table of Contents 15. Tables) gives t-scores 
given the degrees of freedom and the right-tailed probability. The table is 
very limited. Calculators and computers can easily calculate any 
Student's-t probabilities. 

The notation for the Student's-t distribution is (using T as the random 
variable) is 


e T'~tg¢ where df = n — 1. 

e For example, if we have a sample of size n=20 items, then we calculate 
the degrees of freedom as df=n—1=20-1=19 and we write the 
distribution as T' ~ ty9 


If the population standard deviation is not known, the error bound for 
a population mean is: 


EBM = ts - (—) 
¢ ta is the t-score with area to the right equal ers 


e use df = n — 1 degrees of freedom 
e s =sample standard deviation 


The format for the confidence interval is: 


(« — EBM, z + EBM). 


The TI-83, 83+ and 84 calculators have a function that calculates the 
confidence interval directly. To get to it, 

Press STAT 

Arrow over to TESTS. 

Arrow down to 8: TInterval and press ENTER (or just press 8). 


Example: 
Exercise: 


Problem: 


Suppose you do a study of acupuncture to determine how effective it 
is in relieving pain. You measure sensory rates for 15 subjects with 
the results given below. Use the sample data to construct a 95% 
confidence interval for the mean sensory rate for the population 
(assumed normal) from which you took the data. 


The solution is shown step-by-step and by using the TI-83, 83+ and 
84+ calculators. 
8.6 9.4 7.9 6.8 8.3 7.3 9.2 9.6 8.7 11.4 10.3 5.4 8.1 5.5 6.9 


Solution: 


e You can use technology to directly calculate the confidence 
interval. 

e The first solution is step-by-step (Solution A). 

e The second solution uses the Ti-83+ and Ti-84 calculators 
(Solution B). 


Solution A 
To find the confidence interval, you need the sample mean, x, and the 
EBM. 


ie ee PJAUT s— 10722 UG 


dei = ie 
CL=0.95 so a=1—CL=1—0.95 = 0.05 
2=0.025 te =toos 


The area to the right of ¢ 925 is 0.025 and the area to the left of ¢ 25 is 
1-0.025=0.975 


ta =¢ 995 = 2.14 using invT(.975,14) on the TI-84+ calculator. 


EBM = 2.14: (1822) — 0,924 
/15 


x — EBM = 8.2267 — 0.9240 = 7.3 
«4 EBM — 8.2267 -- 0.9240 — 9.15 
The 95% confidence interval is (7.30, 9.15). 


We estimate with 95% confidence that the true population mean 
sensory rate is between 7.30 and 9.15. 


Solution B 
Using a function of the TI-83, TI-83+ or TI-84 calculators: 


Press STAT and arrow over to TESTS. 

Arrow down to 8: TInterval and press ENTER (or you can just press 
8). Arrow to Data and press ENTER. 

Arrow down to List and enter the list name where you put the data. 
Arrow down to Freg and enter 1. 

Arrow down to C- Level and enter .95 

Arrow down to Calculate and press ENTER. 

The 95% confidence interval is (7.3006, 9.1527) 


Note: When calculating the error bound, a probability table for the 
Student's-t distribution can also be used to find the value of t. The table 
gives t-scores that correspond to the confidence level (column) and 
degrees of freedom (row); the t-score is found where the row and column 
intersect in the table. 


** With contributions from Roberta Bloom 


Glossary 


Confidence Interval (CI) 
An interval estimate for an unknown population parameter. This 
depends on: 


e The desired confidence level. 

e Information that is known about the distribution (for example, 
known standard deviation). 

e The sample and its size. 


Confidence Level (CL) 
The percent expression for the probability that the confidence interval 
contains the true population parameter. For example, if the CL = 90% 
, then in 90 out of 100 samples the interval estimate will enclose the 
true population parameter. 


Degrees of Freedom (df) 
The number of objects in a sample that are free to vary. 


Error Bound for a Population Mean (EBM) 
The margin of error. Depends on the confidence level, sample size, and 
known or estimated population standard deviation. 


Normal Distribution 


A continuous random variable (RV) with pdf 


f(x) = TE e~(=-1)*/20° where ps is the mean of the distribution and 


o is the standard deviation. Notation: X ~ N(y, 0). If u = 0 and 
ao = 1, the RV is called the standard normal distribution. 


Standard Deviation 
A number that is equal to the square root of the variance and measures 
how far data values are from their mean. Notation: s for sample 
standard deviation and o for population standard deviation. 


Student's-t Distribution 
Investigated and reported by William S. Gossett in 1908 and published 
under the pseudonym Student. The major characteristics of the random 
variable (RV) are: 


e It is continuous and assumes any real values. 

e The pdf is symmetrical about its mean of zero. However, it is 
more spread out and flatter at the apex than the normal 
distribution. 

e It approaches the standard normal distribution as n gets larger. 

e There is a "family" of t distributions: every representative of the 
family is completely defined by the number of degrees of 
freedom which is one less than the number of data. 


Confidence Interval for a Population Proportion 

Confidence Interval for a Population Proportion is part of the collection 
col10555 written by Barbara Illowsky and Susan Dean with contributions 
from Roberta Bloom. 


During an election year, we see articles in the newspaper that state 
confidence intervals in terms of proportions or percentages. For example, a 
poll for a particular candidate running for president might show that the 
candidate has 40% of the vote within 3 percentage points. Often, election 
polls are calculated with 95% confidence. So, the pollsters would be 95% 
confident that the true proportion of voters who favored the candidate 
would be between 0.37 and 0.43 : (0.40 — 0.03, 0.40 + 0.03). 


Investors in the stock market are interested in the true proportion of stocks 
that go up and down each week. Businesses that sell personal computers are 
interested in the proportion of households in the United States that own 
personal computers. Confidence intervals can be calculated for the true 
proportion of stocks that go up or down each week and for the true 
proportion of households in the United States that own personal computers. 


The procedure to find the confidence interval, the sample size, the error 
bound, and the confidence level for a proportion is similar to that for the 
population mean. The formulas are different. 


How do you know you are dealing with a proportion problem? First, the 
underlying distribution is binomial. (There is no mention of a mean or 
average.) If X is a binomial random variable, then X~B(n, p) where n = 
the number of trials and p = the probability of a success. To form a 
proportion, take X, the random variable for the number of successes and 
divide it by n, the number of trials (or the sample size). The random 
variable P/ (read "P prime") is that proportion, 


Pr= 


n 


(Sometimes the random variable is denoted as P, read "P hat".) 


When n is large and p is not close to 0 or 1, we can use the normal 
distribution to approximate the binomial. 


X~ N(n-p,./n-p-q) 


If we divide the random variable by n, the mean by n, and the standard 
deviation by n, we get a normal distribution of proportions with P/, called 
the estimated proportion, as the random variable. (Recall that a proportion = 
the number of successes divided by n.) 


~ = Pr~ N( 32 wrt) 


n? n 


Using algebra to simplify : “** = ,/2% 


P? follows a normal distribution for proportions: P/ ~ N (», Ba ) 


The confidence interval has the form (p/—EBP, p/+EBP). 


pi= +, 


p! = the estimated proportion of successes (p/ is a point estimate for p, 
the true proportion) 


zx = the number of successes. 
n = the size of the sample 


The error bound for a proportion is 


EBP = ze - jet where q/= 1 — pi 


This formula is similar to the error bound formula for a mean, except that 
the "appropriate standard deviation" is different. For a mean, when the 
population standard deviation is known, the appropriate standard deviation 


that we use is Vm For a proportion, the appropriate standard deviation is 
Pd 

Vit 

However, in the error bound formula, we use 4/ Be as the standard 


deviation, instead of / Pe 


However, in the error bound formula, the standard deviation is ,/ Pe. 


In the error bound formula, the sample proportions p/ and q/ are 
estimates of the unknown population proportions p and gq. The estimated 
proportions p/ and q/ are used because p and q are not known. p/ and q/ are 
calculated from the data. p/ is the estimated proportion of successes. q/ is 
the estimated proportion of failures. 


The confidence interval can only be used if the number of successes np/ 
and the number of failures nq/ are both larger than 5. 


Note:For the normal distribution of proportions, the z-score formula is as 
follows. 


If Pr~ N (», / Ea) then the z-score formula is z = 2&2 


Example: 
Exercise: 


Problem: 


Suppose that a market research firm is hired to estimate the percent of 
adults living in a large city who have cell phones. 500 randomly 
selected adult residents in this city are surveyed to determine whether 
they have cell phones. Of the 500 people surveyed, 421 responded yes 
- they own cell phones. Using a 95% confidence level, compute a 
confidence interval estimate for the true proportion of adults residents 
of this city who have cell phones. 

Solution 


e You can use technology to directly calculate the confidence 
interval. 

e The first solution is step-by-step (Solution A). 

e The second solution uses a function of the TI-83, 83+ or 84 
calculators (Solution B). 


Solution: 


Let X = the number of people in the sample who have cell phones. X 


is binomial. X ~ B(500, =~). 


To calculate the confidence interval, you must find p/, g/, and EBP. 
nm = 500 zx = the number of successes = 421 


WSS e SIL 


pl= 0.842 is the sample proportion; this is the point estimate of the 
population proportion. 


gi= 1 — pl= 1 — 0.842 = 0.158 


Since CL = 0.95, then 
a= 1 CL = 1 — 0,95 = 0:05 3 = 0.025. 


Then Za = 2.025 = 1.96 


Use the TI-83, 83+ or 84+ calculator command invNorm(0.975,0,1) to 
find z.925. Remember that the area to the right of 2,995 is 0.025 and the 
area to the left of 29.925 is 0.975. This can also be found using 
appropriate commands on other calculators, using a computer, or 
using a Standard Normal probability table. 


EBP = 2g -«/ ## = 1.96. 4/ S82) 00) — 0.032 
2 n 


pi DE — 0.642 0052 — 0.6 
pI+EBP = 0.842 + 0.032 = 0.874 


The confidence interval for the true binomial population proportion is 
(p!—EBP, p/+EBP) =(0.810, 0.874). 


Interpretation 
We estimate with 95% confidence that between 81% and 87.4% of all 
adult residents of this city have cell phones. 


Explanation of 95% Confidence Level 

95% of the confidence intervals constructed in this way would contain 
the true value for the population proportion of all adult residents of 
this city who have cell phones. 


Solution: 


Using a function of the TI-83, 83+ or 84 calculators: 


Press STAT and arrow over to TESTS. 

Arrow down to A:1-PropZint. Press ENTER. 
Arrow down to x and enter 421. 

Arrow down to n and enter 500. 

Arrow down to C-Level and enter .95. 

Arrow down to Calculate and press ENTER. 
The confidence interval is (0.81003, 0.87397). 


Example: 
Exercise: 


Problem: 


For a class project, a political science student at a large university 
wants to estimate the percent of students that are registered voters. He 
surveys 500 students and finds that 300 are registered voters. 
Compute a 90% confidence interval for the true percent of students 
that are registered voters and interpret the confidence interval. 


Solution: 


e You can use technology to directly calculate the confidence 
interval. 

e The first solution is step-by-step (Solution A). 

e The second solution uses a function of the TI-83, 83+ or 84 
calculators (Solution B). 


Solution A 
xe = 300 andar— 500. 


eS = Savi 


gi= 1 — pf= 1 — 0.600 = 0.400 


Since CL = 0.90, then 
a=1—CL=1-—0.90 = 0.10 5 = 0.05. 


Ze = 2.05 = 1.645 


Use the TI-83, 83+ or 84+ calculator command invNorm(0.95,0,1) to 
find z.95. Remember that the area to the right of z.95 is 0.05 and the 
area to the left of z.95 is 0.95. This can also be found using 


appropriate commands on other calculators, using a computer, or 
using a Standard Normal probability table. 


EBP = ze - / 22 = 1.645 - 4/ 20) _ 9.036 


p!—EBP = 0.60 — 0.036 = 0.564 


pI+EBP = 0.60 + 0.036 = 0.636 


The confidence interval for the true binomial population proportion is 
(p!—EBP, p/+EBP) =(0.564, 0.636). 
Interpretation: 


e We estimate with 90% confidence that the true percent of all 
students that are registered voters is between 56.4% and 63.6%. 

e Alternate Wording: We estimate with 90% confidence that 
between 56.4% and 63.6% of ALL students are registered voters. 


Explanation of 90% Confidence Level 

90% of all confidence intervals constructed in this way contain the 
true value for the population percent of students that are registered 
voters. 


Solution B 
Using a function of the TI-83, 83+ or 84 calculators: 


Press STAT and arrow over to TESTS. 

Arrow down to A:1-PropZint. Press ENTER. 
Arrow down to x and enter 300. 

Arrow down to n and enter 500. 

Arrow down to C-Level and enter .90. 

Arrow down to Calculate and press ENTER. 
The confidence interval is (0.564, 0.636). 


Calculating the Sample Size n 


If researchers desire a specific margin of error, then they can use the error 
bound formula to calculate the required sample size. 


The error bound formula for a population proportion is 


* EBP = ze - auicle 


n 
e Solving for n gives you an equation for the sample size. 


Example: 

Suppose a mobile phone company wants to determine the current 
percentage of customers aged 50+ that use text messaging on their cell 
phone. How many customers aged 50+ should the company survey in order 
to be 90% confident that the estimated (sample) proportion is within 3 
percentage points of the true population proportion of customers aged 50+ 
that use text messaging on their cell phone. 


Solution 
From the problem, we know that EBP=0.03 (3%=0.03) and 


Za = 295 = 1.645 because the confidence level is 90% 


However, in order to find n , we need to know the estimated (sample) 
proportion p'. Remember that q'=1-p'. But, we do not know p' yet. Since we 
multiply p' and q' together, we make them both equal to 0.5 because p'q'= 
(.5)(.5)=.25 results in the largest possible product. (Try other products: (.6) 
(.4)=.24; (.3)(.7)=.21; (.2)(.8)=.16 and so on). The largest possible product 
gives us the largest n. This gives us a large enough sample so that we can 
be 90% confident that we are within 3 percentage points of the true 
population proportion. To calculate the sample size n, use the formula and 
make the substitutions. 


ee 1.6452(.5)(.5) _ 
Z Pg oe = foley 


Round the answer to the next higher value. The sample size should be 752 
cell phone customers aged 50+ in order to be 90% confident that the 


o— gives n = 


estimated (sample) proportion is within 3 percentage points of the true 
population proportion of all customers aged 50+ that use text messaging on 
their cell phone. 

**With contributions from Roberta Bloom. 


Glossary 


Binomial Distribution 
A discrete random variable (RV) which arises from Bernoulli trials. 
There are a fixed number, n, of independent trials. “Independent” 
means that the result of any trial (for example, trial 1) does not affect 
the results of the following trials, and all trials are conducted under the 
same conditions. Under these circumstances the binomial RV X is 
defined as the number of successes in 7 trials. The notation is: X~ 
B(n, p). The mean is js = np and the standard deviation is o = ,/npq 
. The probability of exactly z successes in 7 trials is 


P(X =2) = (1 )p"q"*. 


Confidence Interval (CI) 
An interval estimate for an unknown population parameter. This 
depends on: 


e The desired confidence level. 

e Information that is known about the distribution (for example, 
known standard deviation). 

e The sample and its size. 


Confidence Level (CL) 
The percent expression for the probability that the confidence interval 
contains the true population parameter. For example, if the CL = 90% 
, then in 90 out of 100 samples the interval estimate will enclose the 
true population parameter. 


Error Bound for a Population Proportion(EBP) 
The margin of error. Depends on the confidence level, sample size, and 
the estimated (from the sample) proportion of successes. 


Normal Distribution 


A continuous random variable (RV) with pdf 
2 
f(x) = amore 20 where p is the mean of the distribution and 


o is the standard deviation. Notation: X ~ N(w,o). If u = 0 and 
ao = 1, the RV is called the standard normal distribution. 


Sample Size 
Calculations for determining the required sample sized when calculation a confidence 
interval for the population mean or population proportion. 


Determining Sample Size Required to Estimate p 


Prior to creating a confidence interval a sample must be taken. Often the number of data 
values needed in a sample to obtain a particular level of confidence within a given error 
needs to be determined prior to taking the sample. If the sample is too small the result 
may not be useful and if the sample is too big both time and money are wasted in the 
sampling. 


Let’s begin by looking at the equation for the Error Bound. 
EBM = Ze Ta 
To affect the size of the error, what can be changed in the formula? 


Only 2, by changing the level of confidence, or n by changing the sample size affects 
the error. The standard deviation is a given which one can not change. (Note if n > 30 
then the sample standard deviation can be used to approximate the population standard 
deviation.) 


Example: 

Given the following data: n = 64, x= 36, and o = 3, the EBM for a 80%, 90%, 95% and 
99% confidence interval are: 

80%: EBM = Zo -~= 1.28 = 0.48 


2 Vn ved 
90%: EBM = Zs “= 1.645 [= = 0.616875 
95%: EBM = Z2 = 1.96 >= 0.735 
O/,.- = a a= se = 
99%: EBM = Ze -7-= 2.58 — = 0.9675 


Note that as the confidence increases, so also does the EBM. To ensure that the error 
bound is small, the confidence must be decreased. Hence changing the confidence to 
lower the error is not a practical solution. 


Example: 


What happens as the sample size is increased? Calculate the EBM for a 90% confidence 
interval for n = 25, 64, and 100. 
n = 25: EBM = Za -—= 1,645 4 = 0987 


eee 
n = 64: EBM = Za San 1.645 a =0.616875 
n= 100. EBM = 7a "== 645 — "= = 0.4935 
2 /n /100 


As the sample size increases, the EBM decreases. The question now becomes how large 
a sample is needed for a particular error? 


Begin by solving the equation for the EBM in terms of n. 
Equation: 


Oo Lao ‘ 
EBM = Za >s>n= 2 
ai EBM 


e Where Z= the critical z score based on the desired confidence level 
¢ EBM = desired margin of error 
¢ o = population standard deviation 


Often the population standard deviation is unknown; hence the sample standard 
deviation from a previous sample of size greater than 30 may be used as an 
approximation to o. 


The value found by using the formula for sample size is generally not a whole 
number. However the sample size must be a whole number, so ALWAYS ROUND 
UP to the next larger whole number. 


Example: 

Suppose for the given information, we want to be 90% confident with an error of only 
+0.5, how large should n be? 

Now substitute the given values: 

Equation: 


Viento 1.645(3) \? 
n= z = ee) = 9.872 = 97.4169 
EBM 0.5 


Since you can not sample less than a whole this value n = 98. 


Example: 
How large a sample would you need if you wanted to be 95% confident with an error of 
+t 0.25? 


Equation: 
af Ze sh ee) as 52? ~ 554.19 > n = 554 
PME = PN RODSE) hat ae a ie 
Exercise: 
Problem: 


Suppose the scores on a statistics final are normally distributed with a standard 
deviation of 10 points. You have been asked to construct a 95% confidence interval 
with an error of no more than 2 points. 


Solution: 
Zo.025 = 1.645 
EMB = 2 
o=10 


Zao 
2 


2 2 
= ( thr ) = (5°) =(8.225)2 = 67.65 rounded up => n = 68. 


Hence, a sample of size 68 must be taken to create a 95% confidence interval with 
an error of no more than two points. 


Exercise: 


Problem: 


Suppose you want to be 98% confident with an error of no more than 1.5 points, 
how large must your sample be? 


Solution: 
Zo.01 = 2.33 


EMB = 1.5 


o=10 


Equation: 
Zao \*  2.33(10) \? 
— 2 — SS — — — — 
n= (su) =f Ls ) (15.533)2 = 241.28 rounded up=>n = 242. 


Hence, a sample of size 242 must be taken to create a 98% confidence interval with 
an error of no more than 1.5 points. 


Determining Sample Size Required to Estimate p. 


Again, let’s look at the equation for the Error Bound. 


¢ where p’ = ~ is the point estimate for the true population proportion 
e x = the number of successes 
e n= the sample size 


e q@=1-p’ 


Za \2 
Solving for n, we obtain: n = (sa) p’q’ 


Example: 

Suppose that a previous study claimed that only 25% of people recycle on a regular 
basis. Determine the sample size needed to create a 95% confidence interval for the true 
population proportion with an error of only +/- 3%. 

p’ =0.25 => sh =1-p’=0.75 

n= (a) p’q’ = (4%)?(0.25)(0.75) = 800.33. 

AWAYS ROUND UP, hence n = 801. 


If there is no previous sample, let p’ = 0.5 and q’ = 0.5 as this gives the largest value 
for n. 


Example: 


What would n be for a 95% confidence interval with an error of only +/-3% with no 
sample data? 
Equation: 


j= (22) ee (E™) oposite 
= MERE | SUG 


AWAYS ROUND UP, hence n = 1068 
Note that not having a previous sample greatly increases the number of data values 
needed in a sample. Often a pilot study is done to generate an approximation for p. 


Exercise: 


Problem: 


The Mesa College mathematics department has noticed that a number of students 
place in a non-transfer level course and only need a 6 week refresher rather than an 
entire semester long course. If it is thought that about 10% of the students fall in 
this category, how many must the department survey is they wish to be 95% certain 
that the true population proportion is within +/-5%? 


Solution: 
Equation: 


cl LY igh OY oan e ey sash 
=NEBR I  OOn er ee 


AWAYS ROUND UP, hence n = 139 
Exercise: 


Problem: 


Suppose the math department has no previous information. How many students 
should be surveyed? 


Solution: 
Equation: 


nsf ME) pgs MOY ition) aan 
e=\ BBP | tN QR ee ee 


AWAYS ROUND UP, hence n = 385 
Exercise: 
Problem: 
Suppose Cardmart wish to know what proportion of men buy their wife a Mother’s 
Day Card. How many people must be sampled is they wish to be 95% certain that 


the proportion is within 2%? Suppose that a previous sample of 500 men reported 
that 421 of them bought their wife a Mother’s Day Card. 


Solution: 
Equation: 
A421 
>= —— = 0).842 

P'* 500 
, hence 
Equation: 

q =1-—p’=1-— 0.842 = 0.158 

E = 0.02 
Equation: 


— (78) pig = ( 184%)" (0.842) (0.158) ~ 899.997 
"\EBP) ? 2 ~ \o02 ) & ee 


AWAYS ROUND UP, n = 900. Hence 900 people need to be surveyed to ensure a 
95% confidence interval with an error of at most 2%. 


Exercise: 


Problem: 


Suppose there is no previous sample. How many men will need to be surveyed? 


Solution: 
Equation: 


Ze \* 1.645 \" 
oa 2 Pe ans fh foe ‘ .5) = 1691.2 
n (se | p’a (| (0.5)(0.5) + 1691.27 


AWAYS ROUND UP, n = 1692. Hence 1692 people need to be surveyed to ensure 
a 95% confidence interval with an error of at most 2%. 


Summary of Formulas 

Formula General form of a confidence interval 

(lower value, upper value) = (point estimate — error bound, point estimate + error bound) 
FormulaTo find the error bound when you know the confidence interval 


error bound = upper value — point estimate OR 


écror bound — upper eine fore value 


FormulaSingle Population Mean, Known Standard Deviation, Normal Distribution 


Use the Normal Distribution for Means EBM = Aes Vn 


The confidence interval has the format (2 — EBM, x + EBM). 
FormulaSingle Population Mean, Unknown Standard Deviation, Student's-t Distribution 


Use the Student's-t Distribution with degrees of freedom df = n — 1. EBM = fz - A 


FormulaSingle Population Proportion, Normal Distribution 


Use the Normal Distribution for a single population proportion pl= = 


EBP = zz - = pil+q= 1 

The confidence interval has the format (p/—EBP, p/+-EBP). 
FormulaPoint Estimates 

zx is a point estimate for yz 

p/ is a point estimate for p 


s is a point estimate for 7 


Practice 1: Confidence Intervals for Averages, Known Population Standard 
Deviation 


Student Learning Outcomes 


e The student will calculate confidence intervals for means when the 
population standard deviation is known. 


Given 


The mean age for all Foothill College students for a recent Fall term was 
33.2. The population standard deviation has been pretty consistent at 15. 
Suppose that twenty-five Winter students were randomly selected. The 
mean age for the sample was 30.4. We are interested in the true mean age 
for Winter Foothill College students. 

(http://research.fhda.edu/factbook/FH Demo _Trends/FoothillDemographic 
Trends.htm 


Let X = the age of a Winter Foothill College student 


Calculating the Confidence Interval 


Exercise: 


Problem: x = 


Solution: 


30.4 


Exercise: 


Problem: n= 


Solution: 


25 


Exercise: 


Problem: 15=(insert symbol here) 


Solution: 


oO 


Exercise: 


Problem: Define the Random Variable, X , in words. 
X= 
Solution: 


the mean age of 25 randomly selected Winter Foothill students 


Exercise: 


Problem: What is x estimating? 
Solution: 


LL 
Exercise: 


Problem: Is 7 ,, known? 
Solution: 


yes 


Exercise: 


Problem: 


As aresult of your answer to (4), state the exact distribution to use 
when calculating the Confidence Interval. 


Solution: 


Normal 


Explaining the Confidence Interval 


Construct a 95% Confidence Interval for the true mean age of Winter 
Foothill College students. 
Exercise: 


Problem: How much area is in both tails (combined)? a = __ 


Solution: 


0.05 


Exercise: 


Problem: How much area is in each tail? > = 


Solution: 


0.025 


Exercise: 
Problem: Identify the following specifications: 
e a lower limit = 


¢ b upper limit = 
e cerror bound = 


Solution: 


e a24.52 
° b36.28 
e c5.88 


Exercise: 


Problem: The 95% Confidence Interval is: 
Solution: 


(24.52,36.28) 
Exercise: 
Problem: 


Fill in the blanks on the graph with the areas, upper and lower 
limits of the Confidence Interval, and the sample mean. 


ae CL.= = 
z2 | nn 2 


Exercise: 
Problem: In one complete sentence, explain what the interval means. 


Discussion Questions 


Exercise: 


Problem: 


Using the same mean, standard deviation and level of confidence, 
suppose that n were 69 instead of 25. Would the error bound become 
larger or smaller? How do you know? 


Exercise: 
Problem: 
Using the same mean, standard deviation and sample size, how would 


the error bound change if the confidence level were reduced to 90%? 
Why? 


Practice 2: Confidence Intervals for Averages, Unknown Population 
Standard Deviation 


Student Learning Outcomes 


e The student will calculate confidence intervals for means when the 
population standard deviation is unknown. 


Given 


The following real data are the result of a random survey of 39 national 
flags (with replacement between picks) from various countries. We are 
interested in finding a confidence interval for the true mean number of 
colors on a national flag. Let X = the number of colors on a national flag. 


xX Freq. 
1 1 

2 7 

3 18 

4 7 

5 6 


Calculating the Confidence Interval 


Exercise: 


Problem: Calculate the following: 


eaxrt=— 
e bs, = 
ee cn = 


Solution: 


e a3.26 
e b1.02 
e ¢39 


Exercise: 


Problem: 


Define the Random Variable, X, in words. X = 


Solution: 


the mean number of colors of 39 flags 


Exercise: 


Problem: What is x estimating? 
Solution: 


LL 
Exercise: 


Problem: Is 7 ,, known? 


Solution: 


No 
Exercise: 


Problem: 


As aresult of your answer to (4), state the exact distribution to use 
when calculating the Confidence Interval. 


Solution: 


t38 


Confidence Interval for the True Mean Number 


Construct a 95% Confidence Interval for the true mean number of colors on 
national flags. 
Exercise: 


Problem: How much area is in both tails (combined)? a = 


Solution: 
0.05 


Exercise: 


Problem: How much area is in each tail? > = 


Solution: 
0.025 
Exercise: 
Problem: Calculate the following: 


e alower limit = 


¢ bupper limit = 
e cerror bound = 


Solution: 


e a2.93 
e b3.59 
e 0.33 


Exercise: 


Problem: The 95% Confidence Interval is: 
Solution: 


2.93: 3.59 
Exercise: 


Problem: 


Fill in the blanks on the graph with the areas, upper and lower limits of 
the Confidence Interval and the sample mean. 


= = é3 = 
2 2 


Exercise: 
Problem: In one complete sentence, explain what the interval means. 


Discussion Questions 


Exercise: 
Problem: 
Using the same 2, s,, and level of confidence, suppose that nm were 69 


instead of 39. Would the error bound become larger or smaller? How 
do you know? 


Exercise: 
Problem: 


Using the same z, s,, and n = 39, how would the error bound change 
if the confidence level were reduced to 90%? Why? 


Practice 3: Confidence Intervals for Proportions 


Student Learning Outcomes 


e The student will calculate confidence intervals for proportions. 


Given 


The Ice Chalet offers dozens of different beginning ice-skating classes. All 
of the class names are put into a bucket. The 5 P.M., Monday night, ages 8 - 
12, beginning ice-skating class was picked. In that class were 64 girls and 
16 boys. Suppose that we are interested in the true proportion of girls, ages 
8 - 12, in all beginning ice-skating classes at the Ice Chalet. Assume that the 
children in the selected class is a random sample of the population. 


Estimated Distribution 
Exercise: 
Problem: What is being counted? 
Exercise: 
Problem: In words, define the Random Variable X.X = 


Solution: 
The number of girls, age 8-12, in the beginning ice skating class 
Exercise: 
Problem: Calculate the following: 
eaxrt=— 


e bn= 
e Cc piI= 


Solution: 


e a64 
e b80 
e c0.8 


Exercise: 


Problem: State the estimated distribution of X.X ~ 


Solution: 
B(80, 0.80) 
Exercise: 
Problem: Define a new Random Variable P/. What is p/ estimating? 
Solution: 


Pp 
Exercise: 


Problem: In words, define the Random Variable P/ . P/= 
Solution: 


The proportion of girls, age 8-12, in the beginning ice skating class. 


Exercise: 


Problem: State the estimated distribution of P/. P!~ 


Explaining the Confidence Interval 


Construct a 92% Confidence Interval for the true proportion of girls in the 
age 8 - 12 beginning ice-skating classes at the Ice Chalet. 
Exercise: 


Problem: How much area is in both tails (combined)? a = 


Solution: 
1 - 0.92 = 0.08 
Exercise: 


Problem: How much area is in each tail? = = 


Solution: 
0.04 
Exercise: 

Problem: Calculate the following: 
e alower limit = 
¢ bupper limit = 
e cerror bound = 

Solution: 


e a0.72 
e b0.88 
e c0.08 


Exercise: 


Problem: The 92% Confidence Interval is: 


Solution: 


(0.72; 0.88) 
Exercise: 


Problem: 
Fill in the blanks on the graph with the areas, upper and lower 


limits of the Confidence Interval, and the sample proportion. 
a 


a L= a 
27-——  h__ y= 


Exercise: 


Problem: In one complete sentence, explain what the interval means. 


Discussion Questions 


Exercise: 
Problem: 
Using the same p/ and level of confidence, suppose that n were 


increased to 100. Would the error bound become larger or smaller? 
How do you know? 


Exercise: 
Problem: 
Using the same p/ and n = 80, how would the error bound change if 
the confidence level were increased to 98%? Why? 


Exercise: 


Problem: 


If you decreased the allowable error bound, why would the minimum 
sample size increase (keeping the same level of confidence)? 


Homework 


Note:If you are using a student's-t distribution for a homework problem 
below, you may assume that the underlying population is normally 
distributed. (In general, you must first prove that assumption, though.) 


Exercise: 


Problem: 


Among various ethnic groups, the standard deviation of heights is 
known to be approximately 3 inches. We wish to construct a 95% 
confidence interval for the mean height of male Swedes. 48 male 
Swedes are surveyed. The sample mean is 71 inches. The sample 
standard deviation is 2.8 inches. 


eda 


if. - . 
ic =_— 
ili s, = 
ivn = 


vn —1= 


Oo: O,O-«O: -0 


e bDefine the Random Variables X and X, in words. 

¢ cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 95% confidence interval for the population mean 
height of male Swedes. 


o iState the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


e¢ eWhat will happen to the level of confidence obtained if 1000 


male Swedes are surveyed instead of 48? Why? 


Solution: 


a 


i71 
13 
112.8 
iv48 
v47 


oO. 0: 20: O- -O 


cN (71,—=) 


? ./48 
d 


o iCI: (70.15,71.85) 
o fiiEB = 0.85 


Exercise: 


Problem: 


In six packages of “The Flintstones® Real Fruit Snacks” there were 5 
Bam-Bam snack pieces. The total number of snack pieces in the six 
bags was 68. We wish to calculate a 96% confidence interval for the 
population proportion of Bam-Bam snack pieces. 


aDefine the Random Variables X and P’, in words. 

bWhich distribution should you use for this problem? Explain 
your choice 

cCalculate p’. 

dConstruct a 96% confidence interval for the population 
proportion of Bam-Bam snack pieces per bag. 


o 4 State the confidence interval. 
© 1iSketch the graph. 


o jiiCalculate the error bound. 


e eDo you think that six packages of fruit snacks yield enough data 
to give accurate results? Why or why not? 


Exercise: 


Problem: 


A random survey of enrollment at 35 community colleges across the 
United States yielded the following figures (source: Microsoft 
Bookshelf): 6414; 1550; 2109; 9350; 21828; 4300; 5944; 5722; 2825; 
2044; 5481; 5200; 5853; 2750; 10012; 6357; 27000; 9414; 7681; 
3200; 17500; 9200; 7380; 18314; 6557; 13713; 17768; 7493; 2771; 
2861; 1263; 7285; 28165; 5080; 11622. Assume the underlying 
population is normal. 


ea 


= 

1 
i |, — 
ivn —1= 


oO 0 Oo 0 


e bDefine the Random Variables X and X, in words. 

e¢ cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 95% confidence interval for the population mean 
enrollment at community colleges in the United States. 


o iState the confidence interval. 
o 1iSketch the graph. 
o jiiCalculate the error bound. 


e eWhat will happen to the error bound and confidence interval if 
500 community colleges were surveyed? Why? 


Solution: 

ea 
18629 
116944 
1135 
iv34 


o Oo 0 0 


e Cc t34 
ed 


o iCI: (6244, 11,014) 
o iiiEB = 2385 


e elt will become smaller 


Exercise: 


Problem: 


From a stack of IEEE Spectrum magazines, announcements for 84 
upcoming engineering conferences were randomly picked. The mean 
length of the conferences was 3.94 days, with a standard deviation of 
1.28 days. Assume the underlying population is normal. 


e aDefine the Random Variables X and X, in words. 

e¢ bWhich distribution should you use for this problem? Explain 
your choice. 

e cConstruct a 95% confidence interval for the population mean 
length of engineering conferences. 


o iState the confidence interval. 


o 11Sketch the graph. 
© jiiCalculate the error bound. 


Exercise: 


Problem: 


Suppose that a committee is studying whether or not there is waste of 
time in our judicial system. It is interested in the mean amount of time 
individuals waste at the courthouse waiting to be called for service. 
The committee randomly surveyed 81 people. The sample mean was 8 
hours with a sample standard deviation of 4 hours. 


ea 


iS. 
ls, = 
HS 2 


ivn —1= 


Oo oOo Oo Oo 


e bDefine the Random Variables X and X, in words. 

e¢ cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 95% confidence interval for the population mean 
time wasted. 


°o aState the confidence interval. 
© bSketch the graph. 
o cCalculate the error bound. 


e eExplain in a complete sentence what the confidence interval 
means. 


Solution: 

ea 
i8 
ii4 
11181 
iv80 


o Oo 0 0 


e C tao 


ed 
© ICI: (7.12, 8.88) 
°o 1iiEB = 0.88 
Exercise: 
Problem: 


Suppose that an accounting firm does a study to determine the time 
needed to complete one person’s tax forms. It randomly surveys 100 
people. The sample mean is 23.6 hours. There is a known standard 
deviation of 7.0 hours. The population distribution is assumed to be 
normal. 


ea 


in. 
lio = 
liis, = 
ivn = _ 

wn —-1l= 


oOo 0 0 0 


e bDefine the Random Variables X and X, in words. 

e cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 90% confidence interval for the population mean 
time to complete the tax forms. 


o 4State the confidence interval. 
o 1iSketch the graph. 
o jiiCalculate the error bound. 


e elf the firm wished to increase its level of confidence and keep 
the error bound the same by taking another survey, what changes 
should it make? 


e flIf the firm did another survey, kept the error bound the same, and 
only surveyed 49 people, what would happen to the level of 
confidence? Why? 

¢ gSuppose that the firm decided that it needed to be at least 96% 
confident of the population mean length of time to within 1 hour. 
How would the number of people the firm surveys change? Why? 


Exercise: 


Problem: 


A sample of 16 small bags of the same brand of candies was selected. 
Assume that the population distribution of bag weights is normal. The 
weight of each bag was then recorded. The mean weight was 2 ounces 
with a standard deviation of 0.12 ounces. The population standard 
deviation is known to be 0.1 ounce. 


eda 


C=. — = —. 
ioe =. +; 
1s. = 
ivn = 


vn —1= 


OH 0 OO” OF 0 


e bDefine the Random Variable X, in words. 

° cDefine the Random Variable X, in words. 

e dWhich distribution should you use for this problem? Explain 
your choice. 

e eConstruct a 90% confidence interval for the population mean 
weight of the candies. 


o iState the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


e fConstruct a 98% confidence interval for the population mean 
weight of the candies. 


o 4State the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


e gIn complete sentences, explain why the confidence interval in (f) 
is larger than the confidence interval in (e). 

¢ hin complete sentences, give an interpretation of what the interval 
in (f) means. 


Solution: 
ea 


i2 

110.1 
ili 0.12 
iv16 
vl5 


oOo 0 0 0 


e bthe weight of 1 small bag of candies 
e cthe mean weight of 16 small bags of candies 
¢ dN (2, iG) 


*e 


o i CTI: (1.96, 2.04) 
o iii EB = 0.04 


o 1 CI: (1.94, 2.06) 
° iii EB = 0.06 


Exercise: 


Problem: 


A pharmaceutical company makes tranquilizers. It is assumed that the 
distribution for the length of time they last is approximately normal. 
Researchers in a hospital used the drug on a random sample of 9 
patients. The effective period of the tranquilizer for each patient (in 
hours) was as follows: 2.7; 2.8; 3.0; 2.3; 2.3; 2.2; 2.8; 2.1; and 2.4. 


eda 


a 
lis, = 
1 — 


ivn —1= 


oO oOo © Oo 


e bDefine the Random Variable X, in words. 

¢ cDefine the Random Variable X, in words. 

e dWhich distribution should you use for this problem? Explain 
your choice. 

e eConstruct a 95% confidence interval for the population mean 
length of time. 


o 4State the confidence interval. 
o 1iSketch the graph. 
o jiiCalculate the error bound. 


e fWhat does it mean to be “95% confident” in this problem? 
Exercise: 
Problem: 


Suppose that 14 children were surveyed to determine how long they 
had to use training wheels. It was revealed that they used them an 
average of 6 months with a sample standard deviation of 3 months. 
Assume that the underlying population distribution is normal. 


ea 


It = = = 

1. 
V1) 
ivn —1= 


o Oo 0 0 


e bDefine the Random Variable X, in words. 

° cDefine the Random Variable X, in words. 

e dWhich distribution should you use for this problem? Explain 
your choice. 

e eConstruct a 99% confidence interval for the population mean 
length of time using training wheels. 


o iState the confidence interval. 
© 1iSketch the graph. 
o j1iCalculate the error bound. 


e fWhy would the error bound change if the confidence level was 
lowered to 90%? 


Solution: 

ea 
i6 
113 
11114 
iv13 


Oo oO 0 -O 


e bthe time for a child to remove his training wheels 

e cthe mean time for 14 children to remove their training wheels. 
e dfs 

ee 


© iCI: (3.58, 8.42) 
o iiiEB = 2.42 


Exercise: 


Problem: 


Insurance companies are interested in knowing the population percent 
of drivers who always buckle up before riding in a car. 


e aWhen designing a study to determine this population proportion, 
what is the minimum number you would need to survey to be 
95% confident that the population proportion is estimated to 
within 0.03? 

e blIf it was later determined that it was important to be more than 
95% confident and a new survey was commissioned, how would 
that affect the minimum number you would need to survey? 
Why? 


Exercise: 


Problem: 


Suppose that the insurance companies did do a survey. They randomly 
surveyed 400 drivers and found that 320 claimed to always buckle up. 
We are interested in the population proportion of drivers who claim to 
always buckle up. 


ea 


10) 1x2 — 
oa |: <n 


° jiip’ = 


e bDefine the Random Variables X and P’, in words. 

¢ cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 95% confidence interval for the population 
proportion that claim to always buckle up. 


o 4State the confidence interval. 
© iSketch the graph. 
o jiiCalculate the error bound. 


e elf this survey were done by telephone, list 3 difficulties the 
companies might have in obtaining random results. 


Solution: 
ea 


© 1320 
° ii 400 
° 1110.80 


(0.80) (0.20) 
bd cv (0.80, y 25 | 


ed 


© iCI: (0.76, 0.84) 
o iii EB = 0.04 


Exercise: 


Problem: 


Unoccupied seats on flights cause airlines to lose revenue. Suppose a 
large airline wants to estimate its mean number of unoccupied seats 
per flight over the past year. To accomplish this, the records of 225 
flights are randomly selected and the number of unoccupied seats is 
noted for each of the sampled flights. The sample mean is 11.6 seats 
and the sample standard deviation is 4.1 seats. 


ea 


 — — ae 
lis, = 
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e bDefine the Random Variables X and X, in words. 


¢ cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 92% confidence interval for the population mean 
number of unoccupied seats per flight. 


o 4State the confidence interval. 
© 1iSketch the graph. 
© tii Calculate the error bound. 


Exercise: 


Problem: 


According to a recent survey of 1200 people, 61% feel that the 
president is doing an acceptable job. We are interested in the 
population proportion of people who feel the president is doing an 
acceptable job. 


e aDefine the Random Variables X and P’, in words. 

¢ bWhich distribution should you use for this problem? Explain 
your choice. 

e cConstruct a 90% confidence interval for the population 
proportion of people who feel the president is doing an acceptable 
job. 


o 4State the confidence interval. 
o 1iSketch the graph. 
o jiiCalculate the error bound. 


Solution: 


(0.61)(0.39) 


ec 


© iCI: (0.59, 0.63) 
o iii EB = 0.02 


Exercise: 


Problem: 


A survey of the mean amount of cents off that coupons give was done 
by randomly surveying one coupon per page from the coupon sections 
of a recent San Jose Mercury News. The following data were 
collected: 20¢; 75¢; 50¢; 65¢; 30¢; 55¢; 40¢; 40¢; 30¢; 55¢; $1.50; 
A40¢; 65¢; 40¢. Assume the underlying distribution is approximately 
normal. 


eda 


| i — 
1.33 = 
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e bDefine the Random Variables X and X, in words. 

¢ cWhich distribution should you use for this problem? Explain 
your choice. 

e dConstruct a 95% confidence interval for the population mean 
worth of coupons. 


o 4State the confidence interval. 
o 1iSketch the graph. 
o jiiCalculate the error bound. 


e elf many random samples were taken of size 14, what percent of 
the confident intervals constructed should contain the population 
mean worth of coupons? Explain why. 


Exercise: 


Problem: 


An article regarding interracial dating and marriage recently appeared 
in the Washington Post. Of the 1709 randomly selected adults, 315 
identified themselves as Latinos, 323 identified themselves as blacks, 
254 identified themselves as Asians, and 779 identified themselves as 
whites. In this survey, 86% of blacks said that their families would 
welcome a white person into their families. Among Asians, 77% 
would welcome a white person into their families, 71% would 
welcome a Latino, and 66% would welcome a black person. 


e aWe are interested in finding the 95% confidence interval for the 
percent of all black families that would welcome a white person 
into their families. Define the Random Variables X and P’, in 
words. 

e bWhich distribution should you use for this problem? Explain 
your choice. 

e cConstruct a 95% confidence interval 


o iState the confidence interval. 


© 1iSketch the graph. 
o jiiCalculate the error bound. 


Solution: 


(0.86)(0.14) 


ec 


o iCI: (0.823, 0.898) 
© fii EB = 0.038 


Exercise: 


Problem:Refer to the problem above. 


e aConstruct three 95% confidence intervals. 


o iPercent of all Asians that would welcome a white person 
into their families. 

o iiPercent of all Asians that would welcome a Latino into 
their families. 

© iiiPercent of all Asians that would welcome a black person 
into their families. 


e bEven though the three point estimates are different, do any of the 
confidence intervals overlap? Which? 

e cFor any intervals that do overlap, in words, what does this imply 
about the significance of the differences in the true proportions? 

e dFor any intervals that do not overlap, in words, what does this 
imply about the significance of the differences in the true 
proportions? 


Exercise: 
Problem: 
A camp director is interested in the mean number of letters each child 
sends during his/her camp session. The population standard deviation 


is known to be 2.5. A survey of 20 campers is taken. The mean from 
the sample is 7.9 with a sample standard deviation of 2.8. 


ea 
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¢ bDefine the Random Variables X and X, in words. 
e¢ cWhich distribution should you use for this problem? Explain 
your choice. 


e dConstruct a 90% confidence interval for the population mean 
number of letters campers send home. 


o 4State the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


e eWhat will happen to the error bound and confidence interval if 
500 campers are surveyed? Why? 


Solution: 
ea 


17.9 
2.5 
iii 2.8 
iv 20 
v 19 


oOo 0 0 0 


‘ 2.5 
c N(7.9, 7) 
ed 


o i CI: (6.98, 8.82) 
© li EB: 0.92 
Exercise: 
Problem: 
Stanford University conducted a study of whether running is healthy 
for men and women over age 50. During the first eight years of the 
study, 1.5% of the 451 members of the 50-Plus Fitness Association 


died. We are interested in the proportion of people over 50 who ran 
and died in the same eight-year period. 


e aDefine the Random Variables X and P’, in words. 


¢ bWhich distribution should you use for this problem? Explain 
your choice. 

e cConstruct a 97% confidence interval for the population 
proportion of people over 50 who ran and died in the same eight— 
year period. 


o iState the confidence interval. 
o 1iSketch the graph. 
o jiiCalculate the error bound. 


e d Explain what a “97% confidence interval” means for this study. 


Exercise: 


Problem: 


In a recent sample of 84 used cars sales costs, the sample mean was 
$6425 with a standard deviation of $3156. Assume the underlying 
distribution is approximately normal. 


e aWhich distribution should you use for this problem? Explain 
your choice. 

e bDefine the Random Variable X, in words. 

e cConstruct a 95% confidence interval for the population mean 
cost of a used car. 


o iState the confidence interval. 
© 11Sketch the graph. 
o j1iCalculate the error bound. 


e dExplain what a “95% confidence interval” means for this study. 


Solution: 


° atg3 
e bmean cost of 84 used cars 
re 


© iCI: (5740.10, 7109.90) 
© iii EB = 684.90 


Exercise: 


Problem: 


A telephone poll of 1000 adult Americans was reported in an issue of 
Time Magazine. One of the questions asked was “What is the main 
problem facing the country?” 20% answered “crime”. We are 
interested in the population proportion of adult Americans who feel 
that crime is the main problem. 


aDefine the Random Variables X and P’, in words. 

bWhich distribution should you use for this problem? Explain 
your choice. 

cConstruct a 95% confidence interval for the population 
proportion of adult Americans who feel that crime is the main 
problem. 


o iState the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


dSuppose we want to lower the sampling error. What is one way 
to accomplish that? 

eThe sampling error given by Yankelovich Partners, Inc. (which 
conducted the poll) is + 3%. In 1-3 complete sentences, explain 
what the + 3% represents. 


Exercise: 


Problem: 


Refer to the above problem. Another question in the poll was “[How 
much are] you worried about the quality of education in our schools?” 
63% responded “a lot”. We are interested in the population proportion 
of adult Americans who are worried a lot about the quality of 
education in our schools. 


1. Define the Random Variables X and P’, in words. 

2. Which distribution should you use for this problem? Explain your 
choice. 

3. Construct a 95% confidence interval for the population proportion 
of adult Americans worried a lot about the quality of education in 
our schools. 


o iState the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


4. The sampling error given by Yankelovich Partners, Inc. (which 
conducted the poll) is + 3%. In 1-3 complete sentences, explain 
what the + 3% represents. 


Solution: 
(0.63)(0.37) 
ec 


© iCI: (0.60, 0.66) 
o iii EB = 0.03 


Exercise: 


Problem: 


Six different national brands of chocolate chip cookies were randomly 
selected at the supermarket. The grams of fat per serving are as 
follows: 8; 8; 10; 7; 9; 9. Assume the underlying distribution is 
approximately normal. 


e aCalculate a 90% confidence interval for the population mean 
grams of fat per serving of chocolate chip cookies sold in 
supermarkets. 


o iState the confidence interval. 
© 1iSketch the graph. 
o jiiCalculate the error bound. 


e bIf you wanted a smaller error bound while keeping the same 
level of confidence, what should have been changed in the study 
before it was done? 

e cGo to the store and record the grams of fat per serving of six 
brands of chocolate chip cookies. 

e dCalculate the mean. 

e els the mean within the interval you calculated in part (a)? Did 
you expect it to be? Why or why not? 


Exercise: 
Problem: 
A confidence interval for a proportion is given to be (— 0.22, 0.34). 


Why doesn’t the lower limit of the confidence interval make practical 
sense? How should it be changed? Why? 


Try these multiple choice questions. 
The next three problems refer to the following: According to a Field 


Poll, 79% of California adults (actual results are 400 out of 506 surveyed) 
feel that “education and our schools” is one of the top issues facing 


California. We wish to construct a 90% confidence interval for the true 
proportion of California adults who feel that education and the schools is 
one of the top issues facing California. (Source: 
http://field.com/fieldpollonline/subscribers/) 

Exercise: 


Problem:A point estimate for the true population proportion is: 


e A0.90 
e B1.27 
e €0.79 
e D400 


Solution: 


C 


Exercise: 


Problem:A 90% confidence interval for the population proportion is: 


¢ A(0.761, 0.820) 
¢ B(0.125, 0.188) 
° C(0.755, 0.826) 
¢ D(0.130, 0.183) 


Solution: 


A 


Exercise: 


Problem:The error bound is approximately 


e Al.581 
e BO.791 


e €0.059 
e DO0.030 


Solution: 


D 


The next two problems refer to the following: 


A quality control specialist for a restaurant chain takes a random sample of 
size 12 to check the amount of soda served in the 16 oz. serving size. The 
sample mean is 13.30 with a sample standard deviation of 1.55. Assume the 
underlying population is normally distributed. 

Exercise: 


Problem: 


Find the 95% Confidence Interval for the true population mean for the 
amount of soda served. 


e A(12.42, 14.18) 
e B(12.32, 14.29) 
e¢ €(12.50, 14.10) 
e DImpossible to determine 


Solution: 


B 


Exercise: 


Problem: What is the error bound? 


e A0.87 
e B1.98 
e €0.99 
e D1.74 


Solution: 


C 
Exercise: 


Problem: 


What is meant by the term “90% confident” when constructing a 
confidence interval for a mean? 


e Alf we took repeated samples, approximately 90% of the samples 
would produce the same confidence interval. 

e BIf we took repeated samples, approximately 90% of the 
confidence intervals calculated from those samples would contain 
the sample mean. 

¢ ClIf we took repeated samples, approximately 90% of the 
confidence intervals calculated from those samples would contain 
the true value of the population mean. 

e DIf we took repeated samples, the sample mean would equal the 
population mean in approximately 90% of the samples. 


Solution: 


C 


The next two problems refer to the following: 


Five hundred and eleven (511) homes in a certain southern California 
community are randomly surveyed to determine if they meet minimal 
earthquake preparedness recommendations. One hundred seventy-three 
(173) of the homes surveyed met the minimum recommendations for 
earthquake preparedness and 338 did not. 

Exercise: 


Problem: 


Find the Confidence Interval at the 90% Confidence Level for the true 
population proportion of southern California community homes 
meeting at least the minimum recommendations for earthquake 
preparedness. 


¢ A(0.2975, 0.3796) 
¢ B(0.6270, 6959) 

¢ C(0.3041, 0.3730) 
¢ D(0.6204, 0.7025) 


Solution: 


C 
Exercise: 


Problem: 


The point estimate for the population proportion of homes that do not 
meet the minimum recommendations for earthquake preparedness is: 


e A0.6614 
¢ B0.3386 
e C173 
e D338 


Solution: 


A 


Review 


The next three problems refer to the following situation: Suppose that a 
sample of 15 randomly chosen people were put on a special weight loss 
diet. The amount of weight lost, in pounds, follows an unknown distribution 
with mean equal to 12 pounds and standard deviation equal to 3 pounds. 
Assume that the distribution for the weight loss is normal. 

Exercise: 


Problem: 


To find the probability that the mean amount of weight lost by 15 
people is no more than 14 pounds, the random variable should be: 


A The number of people who lost weight on the special weight 
loss diet 

BThe number of people who were on the diet 

CThe mean amount of weight lost by 15 people on the special 
weight loss diet 

DThe total amount of weight lost by 15 people on the special 
weight loss diet 


Solution: 


C 


Exercise: 


Problem:Find the probability asked for in the previous problem. 


Solution: 


0.9951 


Exercise: 


Problem: 


Find the 90th percentile for the mean amount of weight lost by 15 
people. 


Solution: 


12,99 


The next three questions refer to the following situation: The time of 
occurrence of the first accident during rush-hour traffic at a major 
intersection is uniformly distributed between the three hour interval 4 p.m. 
to 7 p.m. Let X = the amount of time (hours) it takes for the first accident to 
occur. 


e So, if an accident occurs at 4 p.m., the amount of time, in hours, it took 
for the accident to occur is 

e pL — 

e o — 


Exercise: 


Problem: 


What is the probability that the time of occurrence is within the first 
half-hour or the last hour of the period from 4 to 7 p.m.? 


e ACannot be determined from the information given 


B 
e C 
D 


wl Rr] RoO|e 


Solution: 


G 


Exercise: 


Problem:The 20th percentile occurs after how many hours? 


e A0.20 
e B 0.60 
e C 0.50 
° Dil 


Solution: 


B 
Exercise: 
Problem: 
Assume Ramon has kept track of the times for the first accidents to 


occur for 40 different days. Let C’' = the total cumulative time. Then C’ 
follows which distribution? 


e AU(0,3) 

°B Exp(+) 

° C N(60,5.477) 

* D N(1.5,0.01875) 


Solution: 


C 
Exercise: 


Problem: 


Using the information in question #6, find the probability that the total 
time for all first accidents to occur is more than 43 hours. 


Solution: 


0.9990 


The next two questions refer to the following situation: The length of 
time a parent must wait for his children to clean their rooms is uniformly 
distributed in the time interval from 1 to 15 days. 

Exercise: 


Problem: 


How long must a parent expect to wait for his children to clean their 
rooms? 


e A 8 days 
e B3 days 
e C 14 days 
e D6 days 


Solution: 


A 
Exercise: 


Problem: 


What is the probability that a parent will wait more than 6 days given 
that the parent has already waited more than 3 days? 


¢ A0.5174 
e B0.0174 
e €C 0.7500 
¢ D0.2143 


Solution: 


‘s 


The next five problems refer to the following study: Twenty percent of 
the students at a local community college live in within five miles of the 
campus. Thirty percent of the students at the same community college 
receive some kind of financial aid. Of those who live within five miles of 
the campus, 75% receive some kind of financial aid. 

Exercise: 


Problem: 


Find the probability that a randomly chosen student at the local 
community college does not live within five miles of the campus. 


¢ A 80% 
¢ B20% 
¢ C 30% 
e D Cannot be determined 


Solution: 


A 
Exercise: 
Problem: 
Find the probability that a randomly chosen student at the local 


community college lives within five miles of the campus or receives 
some kind of financial aid. 


¢ A50% 
¢ B35% 
¢ C 27.5% 
¢ D75% 


Solution: 


B 


Exercise: 


Problem: 


Based upon the above information, are living in student housing within 
five miles of the campus and receiving some kind of financial aid 
mutually exclusive? 


e A Yes 
e BNo 
e C Cannot be determined 


Solution: 


B 
Exercise: 


Problem: 
The interest rate charged on the financial aid is data. 


e A quantitative discrete 

e B quantitative continuous 
¢ Cqualitative discrete 

¢ D qualitative 


Solution: 


B 
Exercise: 


Problem: 


What follows is information about the students who receive financial 
aid at the local community college. 


e ist quartile = $250 


e 2nd quartile = $700 
e 3rd quartile = $1200 


(These amounts are for the school year.) If a sample of 200 students is 
taken, how many are expected to receive $250 or more? 


e A50 

e B250 

e C 150 

e D Cannot be determined 


Solution: 


e C150 


The next two problems refer to the following information: P(A) = 0.2, 
P(B) = 0.3, A and B are independent events. 
Exercise: 


Problem: P(A AND B) = 
e AO.5 
e B0.6 
e CO 
e DO.06 

Solution: 

D 

Exercise: 


Problem: P(A OR B) = 


¢ A0.56 


e B05 
e C0.44 
e Dil 


Solution: 


Cc 
Exercise: 
Problem: 


If H and D are mutually exclusive events, P(H) = 0.25, 
P(D) = 0.15, then P(H|D) 


e Al 

¢e BO 

e C 0.40 
e D 0.0375 


Solution: 


B 


Confidence Interval Lab 4 
This lab incorporate both lab 1 and lab 2 with minitab. 


Confidence Interval Lab 


Name: 


Student Learning Outcomes: 
¢ The student will calculate the 90% confidence intervals. 
¢ The student will interpret confidence intervals. 
¢ The student will examine the effects that changing conditions has on the confidence 
interval. 


¢ The student will examine the relationship between the confidence level and the percent of 
constructed intervals that contain the population average. 


Part I 


Collect the Data 


Check the Real Estate section in your local newspaper or website. (Note: many papers only list 
them 


one day per week. Also, we will assume that homes come up for sale randomly.) Record the 


sales prices for 36 randomly selected homes recently listed in the county and indicate whether 
you found them in the paper or on the website. Include the reference. 


Complete the table: 


Describe the Data 


1. Compute the following: (include the session window) 


2. Define the Random Variable X in words: 


3. State the estimated distribution to use. Use both words and symbols. 


Find the Confidence Interval 


1. Using Minitab, calculate the 90% confidence interval and determine the error bound. (include 
in the session window) 


a. Confidence Interval: 

b. Error Bound: 

2. How much area is in both tails (combined)? a = 
3. How much area is in each tail? $= 


4. Using Minitab, create the graph for the 90% confidence interval created above. (Graph > 
Prbability Distribution Plot — View Probability) Include the graph with this lab. 


5. Some students think that a 90% confidence interval contains 90% of the data. Use the list of 
data on the first page and count how many of the data values lie within the confidence interval. 
What percent is this? Is this percent close to 90%? Explain why this percent should or should not 
be close to 90%. 


6. How many house prices would be needed in the sample to ensure that the error was no more 
than $2000 for the 90% confidence interval? 


Describe the Confidence Interval 


1. In two to three complete sentences, explain what a Confidence Interval means (in general), as 
if you were talking to someone who has not taken statistics. 


2. In one to two complete sentences, explain what this Confidence Interval means for this 
particular study. 


Use the Data to Construct Confidence Intervals 


1. Using the above information, construct a confidence interval for each confidence level given. 
(include the session window information) 


Confidence Level 


85% 


95% 


98% 


99% 


2. What happens to the EBM as the confidence level increases? Does the width of the 
confidence interval increase or decrease? Explain why this happens. 


Effect of an Outlier 


Error Bound 


Confidence Interval 


Suppose one of the values input incorrectly. Choose one data value and increase the amount by 


adding two extra zeroes. (include in the session window) 
1. Calculate the 90% confidence interval: 


2. Calculate the Error Bound: 


3. How does the outlier affect the width of the confidence interval? Use complete sentences. 


Part II 


Heights of 100 Women (in Inches) 


59.4 


71.6 


69.3 


65.0 


62.9 


66.5 


63.8 


62.9 


63.0 


63.9 


68.7 


65.5 


61.8 


60.6 


69.8 


60.0 


64.9 


66.1 


61.3 


oo. 


64.1 


Do.3 


64.9 


62.4 


61.5 


64.3 


62.9 


60.6 


63.8 


58.8 


62.9 


63.1 


62.2 


58.7 


64.7 


66.0 


64.3 


65.0 


64.1 


61.1 


65.3 


64.6 


60.8 


65.5 


62.3 


65.5 


64.7 


58.8 


o7.9 


58.5 


63.4 


69.2 


65.9 


62.2 


64.0 


66.4 


61.2 


60.4 


58.7 


66.7 


61.7 61.9 66.8 63.5 64.9 60.5 59.2 66.1 60.0 67.5 
Do. 69.6 60.6 60.9 65.7 64.7 61.4 64.9 58.1 63.2 
67.2 58.7 65.6 63.3 62.5 65.4 62.0 66.9 62.5 56.6 


66.5 63.4 63.8 66.3 70.9 60.2 63.5 ote) 62.4 67.7 


1. Listed above are the heights of 100 women. These are in the student data file in the lab. Use 
MiniTab to randomly select 10 data values. (Calc - Random data — data from a column) 


2. Calculate the sample mean and sample standard deviation and record our answer below. 
Assuming that the population standard deviation is known to be 3.3. Construct a 90% confidence 
interval for your sample of 10 values. Write the confidence interval you obtained below. 


3. Repeat #1 and #2 nine more times, and record the 90% confidence interval for each of your 
samples. (Note: since you are randomly selecting the 10 data values you may use the same one 
in more than one sample. Include this portion of your session window) 


e Sample 1 z=s = CI = 
Sample 2 z=s = CI = 
Sample 3 z= s = CI = 
Sample 4 z=s = Cl = 
Sample 5 z=s = Cl = 
Sample 6 z= s = CI = 
Sample 7 z= s = CI = 
Sample 8 z= s = CI = 
Sample 9 z= s = CI = 
Sample 10 z=s = CI = 


Discussion Questions 

1. The actual population mean for the 100 heights given above is 1 = 63.3. Using the samples 
above, for how many intervals does the value of 1 lie between the endpoints of the confidence 
interval? . 

2. What percentage of the total number of confidence intervals generated contain the mean 1? 


3. Is the percent of confidence intervals that contain the population mean pt close to 90%? 


4. Suppose we had generated 100 confidence intervals. What do you think would happen to the 
percent of confidence intervals that contained the population mean? Use 2 — 3 complete 
sentences. 


5. When we construct a 90% confidence interval, we say that we are 90% confident that the true 
population mean lies within the confidence interval. Using complete sentences, explain what we 


mean by this phrase in terms a non-statistician would understand. 


6. Some students think that a 90% confidence interval contains 90% of the data. Use the first 
confidence interval calculated and count how many of the 100 data values lie within that 
confidence interval? What percent is this? Is this percent close to 90%? Using 3 — 4 sentences to 
explain why it should or should not be close to 90%. 


Hypothesis Testing: Single Mean and Single Proportion 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Differentiate between Type I and Type II Errors 

e Describe hypothesis testing in general and in practice 

e Conduct and interpret hypothesis tests for a single population mean, 
population standard deviation known. 

¢ Conduct and interpret hypothesis tests for a single population mean, 
population standard deviation unknown. 

¢ Conduct and interpret hypothesis tests for a single population 
proportion. 


Introduction 


One job of a Statistician is to make statistical inferences about populations 
based on samples taken from the population. Confidence intervals are one 
way to estimate a population parameter. Another way to make a Statistical 
inference is to make a decision about a parameter. For instance, a car dealer 
advertises that its new small truck gets 35 miles per gallon, on the average. 
A tutoring service claims that its method of tutoring helps 90% of its 
students get an A or a B. A company says that women managers in their 
company earn an average of $60,000 per year. 


A Statistician will make a decision about these claims. This process is called 
"hypothesis testing." A hypothesis test involves collecting data from a 
sample and evaluating the data. Then, the statistician makes a decision as to 
whether or not there is sufficient evidence based upon analyses of the data, 
to reject the null hypothesis. 


In this chapter, you will conduct hypothesis tests on single means and single 
proportions. You will also learn about the errors associated with these tests. 


Hypothesis testing consists of two contradictory hypotheses or statements, a 
decision based on the data, and a conclusion. To perform a hypothesis test, a 


Statistician will: 


1. Set up two contradictory hypotheses. 

2. Collect sample data (in homework problems, the data or summary 
Statistics will be given to you). 

3. Determine the correct distribution to perform the hypothesis test. 

4. Analyze sample data by performing the calculations that ultimately 
will allow you to reject or fail to reject the null hypothesis. 

5. Make a decision and write a meaningful conclusion. 


Note:To do the hypothesis test homework problems for this chapter and 
later chapters, make copies of the appropriate special solution sheets. See 
the Table of Contents topic "Solution Sheets". 


Glossary 


Confidence Interval (CI) 
An interval estimate for an unknown population parameter. This 


depends on: 


e The desired confidence level. 

e Information that is known about the distribution (for example, 
known standard deviation). 

e The sample and its size. 


Hypothesis Testing 
Based on sample evidence, a procedure to determine whether the 
hypothesis stated is a reasonable statement and cannot be rejected, or 
is unreasonable and should be rejected. 


Null and Alternate Hypotheses 


The actual test begins by considering two hypotheses. They are called the 
null hypothesis and the alternate hypothesis. These hypotheses contain 
opposing viewpoints. 


H,: The null hypothesis: It is a statement about the population that will be 
assumed to be true unless it can be shown to be incorrect beyond a 
reasonable doubt. 


H{,: The alternate hypothesis: It is a claim about the population that is 
contradictory to H, and what we conclude when we reject Ho. 


Example: 

H,: No more than 30% of the registered voters in Santa Clara County 
voted in the primary election. 

Hf,: More than 30% of the registered voters in Santa Clara County voted in 
the primary election. 


Example: 

We want to test whether the mean grade point average in American 
colleges is different from 2.0 (out of 4.0). 

a ft — 2.0 lage Mise Bal) 


Example: 

We want to test if college students take less than five years to graduate 
from college, on the average. 

ed 5 0 Jib [he 


Example: 


In an issue of U. S. News and World Report, an article on school 
standards stated that about half of all students in France, Germany, and 
Israel take advanced placement exams and a third pass. The same article 
stated that 6.6% of U. S. students take advanced placement exams and 4.4 
% pass. Test if the percentage of U. S. students who take advanced 
placement exams is more than 6.6%. 

A,: p= 0.066 Ha: p > 0.066 


Since the null and alternate hypotheses are contradictory, you must examine 
evidence to decide if you have enough evidence to reject the null hypothesis 
or not. The evidence is in the form of sample data. 


After you have determined which hypothesis the sample supports, you 
make a decision. There are two options for a decision. They are "reject H," 
if the sample information favors the alternate hypothesis or "do not reject 
Hf," or "fail to reject H," if the sample information is insufficient to reject 
the null hypothesis. 


Mathematical Symbols Used in H, and A: 


isk A, 


- not equal (+) or greater than (>) or less 
ce) than (<) 
greater than or equal 
to) less than (<) 
less than or equal to ( 


<) more than (>) 


Note: 1, always has a symbol with an equal in it. H, never has a symbol 
with an equal in it. The choice of symbol depends on the wording of the 
hypothesis test. However, be aware that many researchers (including one of 
the co-authors in research work) use = in the Null Hypothesis, even with 
> or < as the symbol in the Alternate Hypothesis. This practice is 
acceptable because we only make the decision to reject or not reject the 
Null Hypothesis. 


Optional Collaborative Classroom Activity 


Bring to class a newspaper, some news magazines, and some Internet 
articles . In groups, find articles from which your group can write a null and 
alternate hypotheses. Discuss your hypotheses with the rest of the class. 


Glossary 


Hypothesis 
A statement about the value of a population parameter. In case of two 
hypotheses, the statement assumed to be true is called the null 
hypothesis (notation Ho) and the contradictory statement is called the 
alternate hypothesis (notation H,). 


Outcomes and the Type I and Type II Errors 


When you perform a hypothesis test, there are four possible outcomes 
depending on the actual truth (or falseness) of the null hypothesis H, and 
the decision to reject or not. The outcomes are summarized in the following 
table: 


ACTION H, 1S ACTUALLY 

True False 
Do not reject H, Correct Outcome Type II error 
Reject H, Type I Error Correct Outcome 


The four possible outcomes in the table are: 


e The decision is to not reject H, when, in fact, H, is true (correct 
decision). 

e The decision is to reject H, when, in fact, H, is true (incorrect 
decision known as a Type I error). 

e The decision is to not reject H, when, in fact, H, is false (incorrect 
decision known as a Type I error). 

e The decision is to reject H, when, in fact, H, is false (correct 
decision whose probability is called the Power of the Test). 


Each of the errors occurs with a particular probability. The Greek letters a 
and @ represent the probabilities. 


qa = probability of a Type I error = P(Type I error) = probability of 
rejecting the null hypothesis when the null hypothesis is true. 


@ = probability of a Type II error = P(Type II error) = probability of not 
rejecting the null hypothesis when the null hypothesis is false. 


a and £ should be as small as possible because they are probabilities of 
errors. They are rarely 0. 


The Power of the Test is 1 — @. Ideally, we want a high power that is as 
close to 1 as possible. Increasing the sample size can increase the Power of 
the Test. 


The following are examples of Type I and Type II errors. 


Example: 

Suppose the null hypothesis, Ho, is: Frank's rock climbing equipment is 
safe. 

Type I error: Frank thinks that his rock climbing equipment may not be 
safe when, in fact, it really is safe. Type II error: Frank thinks that his 
rock climbing equipment may be safe when, in fact, it is not safe. 

qa = probability that Frank thinks his rock climbing equipment may not be 
safe when, in fact, it really is safe. 6 = probability that Frank thinks his 
rock climbing equipment may be safe when, in fact, it is not safe. 

Notice that, in this case, the error with the greater consequence is the Type 
II error. (If Frank thinks his rock climbing equipment is safe, he will go 
ahead and use it.) 


Example: 

Suppose the null hypothesis, H,, is: The victim of an automobile accident 
is alive when he arrives at the emergency room of a hospital. 

Type I error: The emergency crew thinks that the victim is dead when, in 
fact, the victim is alive. Type II error: The emergency crew does not 
know if the victim is alive when, in fact, the victim is dead. 

qa = probability that the emergency crew thinks the victim is dead when, 
in fact, he is really alive = P(Type I error). G = probability that the 


emergency crew does not know if the victim is alive when, in fact, the 
victim is dead = P(Type IJ error). 

The error with the greater consequence is the Type I error. (If the 
emergency crew thinks the victim is dead, they will not treat him.) 


Glossary 


Type 1 Error 
The decision is to reject the Null hypothesis when, in fact, the Null 


hypothesis is true. 


Type 2 Error 
The decision is to not reject the Null hypothesis when, in fact, the Null 


hypothesis is false. 


Distribution Needed for Hypothesis Testing 


Earlier in the course, we discussed sampling distributions. Particular 
distributions are associated with hypothesis testing. Perform tests of a 
population mean using a normal distribution or a student's-t 
distribution. (Remember, use a student's-t distribution when the population 
standard deviation is unknown and the distribution of the sample mean is 
approximately normal.) In this chapter we perform tests of a population 
proportion using a normal distribution (usually n is large or the sample size 
is large). 


If you are testing a single population mean, the distribution for the test is 
for means: 


Xe N(ux, 3) or tar 


The population parameter is jz. The estimated value (point estimate) for pu is 
x, the sample mean. 


If you are testing a single population proportion, the distribution for the 
test is for proportions or percentages: 


Pwo) 


The population parameter is p. The estimated value (point estimate) for p is 
p’. p’ = = where z is the number of successes and n is the sample size. 


Glossary 


Normal Distribution 
A continuous random variable (RV) with pdf 


Ex) = i e (eH)"/ 20" where yz is the mean of the distribution and 


o is the standard deviation. Notation: X ~ N(y, 0). If uw = 0 and 
o = 1, the RV is called the standard normal distribution. 


Standard Deviation 
A number that is equal to the square root of the variance and measures 
how far data values are from their mean. Notation: s for sample 
standard deviation and o for population standard deviation. 


Student's-t Distribution 
Investigated and reported by William S. Gossett in 1908 and published 
under the pseudonym Student. The major characteristics of the random 
variable (RV) are: 


e It is continuous and assumes any real values. 

e The pdf is symmetrical about its mean of zero. However, it is 
more spread out and flatter at the apex than the normal 
distribution. 

e It approaches the standard normal distribution as n gets larger. 

e There is a "family" of t distributions: every representative of the 
family is completely defined by the number of degrees of 
freedom which is one less than the number of data. 


Assumption 


When you perform a hypothesis test of a single population mean py using 
a Student's-t distribution (often called a t-test), there are fundamental 
assumptions that need to be met in order for the test to work properly. Your 
data should be a simple random sample that comes from a population that 
is approximately normally distributed. You use the sample standard 
deviation to approximate the population standard deviation. (Note that if 
the sample size is sufficiently large, a t-test will work even if the population 
is not approximately normally distributed). 


When you perform a hypothesis test of a single population mean jy using 
a normal distribution (often called a z-test), you take a simple random 
sample from the population. The population you are testing is normally 
distributed or your sample size is sufficiently large. You know the value of 
the population standard deviation. 


When you perform a hypothesis test of a single population proportion p, 
you take a simple random sample from the population. You must meet the 
conditions for a binomial distribution which are there are a certain number 
n of independent trials, the outcomes of any trial are success or failure, and 
each trial has the same probability of a success p. The shape of the binomial 
distribution needs to be similar to the shape of the normal distribution. To 
ensure this, the quantities np and nq must both be greater than five (np > 5 
and nq > 5). Then the binomial distribution of sample (estimated) 
proportion can be approximated by the normal distribution with 4 = p and 


o = ,/**. Remember that g = 1 — p. 


Glossary 


Binomial Distribution 
A discrete random variable (RV) which arises from Bernoulli trials. 
There are a fixed number, n, of independent trials. “Independent” 
means that the result of any trial (for example, trial 1) does not affect 
the results of the following trials, and all trials are conducted under the 
same conditions. Under these circumstances the binomial RV X is 


defined as the number of successes in 7 trials. The notation is: X~ 
B(n, p). The mean is js = np and the standard deviation is o = ,/npq 
. The probability of exactly z successes in 7 trials is 


P(X = 2) = (1 )p*q"*. 


Normal Distribution 
A continuous random variable (RV) with pdf 


fx) = Sr e~(*-H)?/20° | where jy. is the mean of the distribution and 


o is the standard deviation. Notation: X ~ N(y, 0). If uw = 0 and 
ao = 1, the RV is called the standard normal distribution. 


Standard Deviation 
A number that is equal to the square root of the variance and measures 
how far data values are from their mean. Notation: s for sample 
standard deviation and o for population standard deviation. 


Student-t Distribution 
Investigated and reported by William S. Gossett in 1908 and published 
under the pseudonym Student. The major characteristics of the random 
variable (RV) are: 


e It is continuous and assumes any real values. 

e The pdf is symmetrical about its mean of zero. However, it is 
more spread out and flatter at the apex than the normal 
distribution. 

e It approaches the standard normal distribution as n gets larger. 

e There is a "family" of t distributions: every representative of the 
family is completely defined by the number of degrees of 
freedom which is one less than the number of data. 


Rare Events 


Suppose you make an assumption about a property of the population (this 
assumption is the null hypothesis). Then you gather sample data randomly. 
If the sample has properties that would be very unlikely to occur if the 
assumption is true, then you would conclude that your assumption about the 
population is probably incorrect. (Remember that your assumption is just an 
assumption - it is not a fact and it may or may not be true. But your sample 
data are real and the data are showing you a fact that seems to contradict 
your assumption. ) 


For example, Didi and Ali are at a birthday party of a very wealthy friend. 
They hurry to be first in line to grab a prize from a tall basket that they 
cannot see inside because they will be blindfolded. There are 200 plastic 
bubbles in the basket and Didi and Ali have been told that there is only one 
with a $100 bill. Didi is the first person to reach into the basket and pull out 
a bubble. Her bubble contains a $100 bill. The probability of this happening 
is sr = 0.005. Because this is so unlikely, Ali is hoping that what the two 


of them were told is wrong and there are more $100 bills in the basket. A 
"rare event" has occurred (Didi getting the $100 bill) so Ali doubts the 
assumption about only one $100 bill being in the basket. 


Glossary 


Hypothesis 
A statement about the value of a population parameter. In case of two 
hypotheses, the statement assumed to be true is called the null 
hypothesis (notation Ho) and the contradictory statement is called the 
alternate hypothesis (notation H,). 


Using the Sample to Support One of the Hypotheses 


Use the sample data to calculate the actual probability of getting the test 
result, called the p-value. The p-value is the probability that, if the null 
hypothesis is true, the results from another randomly selected sample 
will be as extreme or more extreme as the results obtained from the 
given sample. 


A large p-value calculated from the data indicates that we should fail to 
reject the null hypothesis. The smaller the p-value, the more unlikely the 
outcome, and the stronger the evidence is against the null hypothesis. We 
would reject the null hypothesis if the evidence is strongly against it. 


Draw a graph that shows the p-value. The hypothesis test is easier to 
perform if you use a graph because you see the problem more clearly. 


Example: 
(to illustrate the p-value) 
Suppose a baker claims that his bread height is more than 15 cm, on the 
average. Several of his customers do not believe him. To persuade his 
customers that he is right, the baker decides to do a hypothesis test. He 
bakes 10 loaves of bread. The mean height of the sample loaves is 17 cm. 
The baker knows from baking hundreds of loaves of bread that the 
standard deviation for the height is 0.5 cm. and the distribution of heights 
is normal. 
The null hypothesis could be H,: ~ < 15 The alternate hypothesis is H,: 
Tes 
The words "is more than" translates as a'">" so "~ > 15" goes into the 
alternate hypothesis. The null hypothesis must contradict the alternate 
hypothesis. 
Since o is known (o0 = 0.5 cm.), the distribution for the population is 
known to be normal with mean = 15 and standard deviation = 

0.5 
Suppose the null hypothesis is true (the mean height of the loaves is no 
more than 15 cm). Then is the mean height (17 cm) calculated from the 


sample unexpectedly large? The hypothesis test works by asking the 
question how unlikely the sample mean would be if the null hypothesis 
were true. The graph shows how far out the sample mean is on the normal 
curve. The p-value is the probability that, if we were to take other samples, 
any other sample mean would fall at least as far out as 17 cm. 

The p-value, then, is the probability that a sample mean is the same or 
greater than 17 cm. when the population mean is, in fact, 15 cm. We 
can calculate this probability using the normal distribution for means from 
Chapter 7. 


p-value is 
approximately 0 


15 17 


p-value = P(x > 17) which is approximately 0. 

A p-value of approximately 0 tells us that it is highly unlikely that a loaf of 
bread rises no more than 15 cm, on the average. That is, almost 0% of all 
loaves of bread would be at least as high as 17 cm. purely by CHANCE 
had the population mean height really been 15 cm. Because the outcome of 
17 cm. is so unlikely (meaning it is happening NOT by chance alone), 
we conclude that the evidence is strongly against the null hypothesis (the 
mean height is at most 15 cm.). There is sufficient evidence that the true 
mean height for the population of the baker's loaves of bread is greater than 
15 cm. 


Glossary 


Hypothesis 
A statement about the value of a population parameter. In case of two 
hypotheses, the statement assumed to be true is called the null 
hypothesis (notation Ho) and the contradictory statement is called the 
alternate hypothesis (notation H,). 


p-value 
The probability that an event will happen purely by chance assuming 
the null hypothesis is true. The smaller the p-value, the stronger the 
evidence is against the null hypothesis. 


Standard Deviation 
A number that is equal to the square root of the variance and measures 
how far data values are from their mean. Notation: s for sample 
standard deviation and o for population standard deviation. 


Decision and Conclusion 


A systematic way to make a decision of whether to reject or not reject the 
null hypothesis is to compare the p-value and a preset or preconceived a 
(also called a "significance level"). A preset a is the probability of a Type 
L error (rejecting the null hypothesis when the null hypothesis is true). It 
may or may not be given to you at the beginning of the problem. 


When you make a decision to reject or not reject Ho, do as follows: 


e If a > p-value, reject H,. The results of the sample data are 
significant. There is sufficient evidence to conclude that H, is an 
incorrect belief and that the alternative hypothesis, H,, may be 
correct. 

e If a < p-value, do not reject H,. The results of the sample data are 
not significant. There is not sufficient evidence to conclude that the 
alternative hypothesis, H,, may be correct. 

e When you "do not reject H,", it does not mean that you should believe 
that H, is true. It simply means that the sample data have failed to 
provide sufficient evidence to cast serious doubt about the truthfulness 
Ol ff is 


Conclusion: After you make your decision, write a thoughtful conclusion 
about the hypotheses in terms of the given problem. 


Glossary 


Hypothesis 
A statement about the value of a population parameter. In case of two 
hypotheses, the statement assumed to be true is called the null 
hypothesis (notation Ho) and the contradictory statement is called the 
alternate hypothesis (notation H,). 


Level of Significance of the Test 
Probability of a Type I error (reject the null hypothesis when it is true). 
Notation: a. In hypothesis testing, the Level of Significance is called 
the preconceived a or the preset a. 


p-value 
The probability that an event will happen purely by chance assuming 
the null hypothesis is true. The smaller the p-value, the stronger the 
evidence is against the null hypothesis. 


Type 1 Error 
The decision is to reject the Null hypothesis when, in fact, the Null 
hypothesis is true. 


Additional Information 


In a hypothesis test problem, you may see words such as "the level of 
significance is 1%." The "1%" is the preconceived or preset a. 

The statistician setting up the hypothesis test selects the value of a@ to 
use before collecting the sample data. 

If no level of significance is given, the accepted standard is to use 
a = 0.05. 

When you calculate the p-value and draw the picture, the p-value is 
the area in the left tail, the right tail, or split evenly between the two 
tails. For this reason, we call the hypothesis test left, right, or two 
tailed. 

The alternate hypothesis, H,, tells you if the test is left, right, or two- 
tailed. It is the key to conducting the appropriate test. 

Hf, never has a symbol that contains an equal sign. 

Thinking about the meaning of the p-value: A data analyst (and 
anyone else) should have more confidence that he made the correct 
decision to reject the null hypothesis with a smaller p-value (for 
example, 0.001 as opposed to 0.04) even if using the 0.05 level for 
alpha. Similarly, for a large p-value like 0.4, as opposed to a p-value of 
0.056 (alpha = 0.05 is less than either number), a data analyst should 
have more confidence that she made the correct decision in failing to 
reject the null hypothesis. This makes the data analyst use judgment 
rather than mindlessly applying rules. 


The following examples illustrate a left, right, and two-tailed test. 


Example: 

Gt 0 Vela (Ve <5) 

Test of a single population mean. H, tells you the test is left-tailed. The 
picture of the p-value is as follows: 


p-value 


4 | 


Example: 

ig p= 0:2 ip. 0:2 

This is a test of a single population proportion. H, tells you the test is 
right-tailed. The picture of the p-value is as follows: 


p-value 


0.2 


Example: 

a: ff = 50 Ha? jb 50 

This is a test of a single population mean. H, tells you the test is two- 
tailed. The picture of the p-value is as follows. 


l l 
— (p-value — p-value) 
5 (p-value) 5 | 


4 | 


50 


Glossary 


Hypothesis Testing 
Based on sample evidence, a procedure to determine whether the 
hypothesis stated is a reasonable statement and cannot be rejected, or 
is unreasonable and should be rejected. 


p-value 
The probability that an event will happen purely by chance assuming 
the null hypothesis is true. The smaller the p-value, the stronger the 
evidence is against the null hypothesis. 


Summary of the Hypothesis Test 


The hypothesis test itself has an established process. This can be 
summarized as follows: 


. Determine H, and H,. Remember, they are contradictory. 

. Determine the random variable. 

. Determine the distribution for the test. 

. Draw a graph, calculate the test statistic, and use the test statistic to 
calculate the p-value. (A z-score and a t-score are examples of test 
Statistics.) 

5. Compare the preconceived @ with the p-value, make a decision (reject 

or do not reject H,), and write a clear conclusion using English 

sentences. 


BRWN FP 


Notice that in performing the hypothesis test, you use @ and not £. is 
needed to help determine the sample size of the data that is used in 
calculating the p-value. Remember that the quantity 1 — £ is called the 
Power of the Test. A high power is desirable. If the power is too low, 
Statisticians typically increase the sample size while keeping a the same. If 
the power is low, the null hypothesis might not be rejected when it should 
be. 


Glossary 


Hypothesis Testing 
Based on sample evidence, a procedure to determine whether the 
hypothesis stated is a reasonable statement and cannot be rejected, or 
is unreasonable and should be rejected. 


p-value 
The probability that an event will happen purely by chance assuming 
the null hypothesis is true. The smaller the p-value, the stronger the 
evidence is against the null hypothesis. 


Examples 

This module provides examples of Hypothesis Testing of a Single Mean 
and a Single Proportion as a part of the Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean. 


Example: 
Exercise: 


Problem: 


Jeffrey, as an eight-year old, established a mean time of 16.43 
seconds for swimming the 25-yard freestyle, with a standard 
deviation of 0.8 seconds. His dad, Frank, thought that Jeffrey could 
swim the 25-yard freestyle faster by using goggles. Frank bought 
Jeffrey a new pair of expensive goggles and timed Jeffrey for 15 25- 
yard freestyle swims. For the 15 swims, Jeffrey's mean time was 16 
seconds. Frank thought that the goggles helped Jeffrey to swim 
faster than the 16.43 seconds. Conduct a hypothesis test using a 
preset a = 0.05. Assume that the swim times for the 25-yard 
freestyle are normal. 


Solution: 
Set up the Hypothesis Test: 


Since the problem is about a mean, this is a test of a single 
population mean. 


Ho: u=16.43 Ha:w< 16.43 


For Jeffrey to swim faster, his time will be less than 16.43 seconds. 
The "<" tells you this is left-tailed. 


Determine the distribution needed: 


Random variable: X = the mean time to swim the 25-yard freestyle. 


Distribution for the test: X is normal (population standard 
deviation is known: o = 0.8) 


X N(n, x ) Therefore, X ~ N (16.43, 238. 


jt = 16.43 comes from AH and not the data. o = 0.8, andn = 15. 
Calculate the p-value using the normal distribution for a mean: 
p-value = P(a << 16) = 0.0187 where the sample mean in the 
problem is given as 16. 


p-value = 0.0187 (This is called the actual level of significance.) 
The p-value is the area to the left of the sample mean is given as 16. 


Graph: 
p-value - 
x= 16 
nw = 16.43 
16 16.43 ss 


jt = 16.43 comes from H,. Our assumption is wy = 16.43. 
Interpretation of the p-value: If H, is true, there is a 0.0187 
probability (1.87%) that Jeffrey's mean time to swim the 25-yard 
freestyle is 16 seconds or less. Because a 1.87% chance is small, the 


mean time of 16 seconds or less is unlikely to have happened 
randomly. It is a rare event. 


Compare q@ and the p-value: 
a = 0.05 p-value = 0.0187 a > p-value 


Make a decision: Since a > p-value, reject Ho. 


This means that you reject 44 = 16.43. In other words, you do not 
think Jeffrey swims the 25-yard freestyle in 16.43 seconds but faster 
with the new goggles. 


Conclusion: At the 5% significance level, we conclude that Jeffrey 
swims faster using the new goggles. The sample data show there is 
sufficient evidence that Jeffrey's mean time to swim the 25-yard 
freestyle is less than 16.43 seconds. 


The p-value can easily be calculated using the TI-83+ and the TI-84 
calculators: 


Press STAT and arrow over to TESTS. Press 1:Z-Test. Arrow over 
to Stats and press ENTER. Arrow down and enter 16.43 for fp (null 
hypothesis), .8 for 0, 16 for the sample mean, and 15 for n. Arrow 
down to p: (alternate hypothesis) and arrow over to <fg. Press 
ENTER. Arrow down to Calculate and press ENTER. The 
calculator not only calculates the p-value (p = 0.0187) but it also 
calculates the test statistic (z-score) for the sample mean.  < 16.43 
is the alternate hypothesis. Do this set of instructions again except 
arrow to Draw (instead of Calculate). Press ENTER. A shaded 
graph appears with z = —2.08 (test statistic) and p = 0.0187 (p- 
value). Make sure when you use Dr aw that no other equations are 
highlighted in Y = and the plots are turned off. 


When the calculator does a Z-Test, the Z- Test function finds the p- 
value by doing a normal probability calculation using the Central 
Limit Theorem: 


(eo 2nd DISTR normcdf 
(—10 WHA 16.43, 0.8/V15). 
The Type I and Type IJ errors for this problem are as follows: 


The Type I error is to conclude that Jeffrey swims the 25-yard 
freestyle, on average, in less than 16.43 seconds when, in fact, he 


actually swims the 25-yard freestyle, on average, in 16.43 seconds. 
(Reject the null hypothesis when the null hypothesis is true.) 


The Type II error is that there is not evidence to conclude that Jeffrey 
swims the 25-yard free-style, on average, in less than 16.43 seconds 
when, in fact, he actually does swim the 25-yard free-style, on 
average, in less than 16.43 seconds. (Do not reject the null hypothesis 
when the null hypothesis is false.) 


Historical Note: The traditional way to compare the two probabilities, a 
and the p-value, is to compare the critical value (z-score from q) to the test 
Statistic (z-score from data). The calculated test statistic for the p-value is 
—2.08. (From the Central Limit Theorem, the test statistic formula is 


(=) . For this problem, = 16, wx = 16.43 from the null 
hypothesis, ox = 0.8, and nm = 15.) You can find the critical value for 

a = 0.05 in the normal table (see 15.Tables in the Table of Contents). The 
z-score for an area to the left equal to 0.05 is midway between -1.65 and 
-1.64 (0.05 is midway between 0.0505 and 0.0495). The z-score is -1.645. 
Since —1.645 > —2.08 (which demonstrates that @ > p-value), reject Ho. 
Traditionally, the decision to reject or not reject was done in this way. 
Today, comparing the two probabilities a and the p-value is very common. 
For this problem, the p-value, 0.0187 is considerably smaller than a, 0.05. 
You can be confident about your decision to reject. The graph shows a, the 
p-value, and the test statistics and the critical value. 


— 


p-value = 0.0187 


-2.08 -1.645 0 


Example: 
Exercise: 


Problem: 


A college football coach thought that his players could bench press a 
mean weight of 275 pounds. It is known that the standard deviation 
is 55 pounds. Three of his players thought that the mean weight was 
more than that amount. They asked 30 of their teammates for their 
estimated maximum lift on the bench press exercise. The data ranged 
from 205 pounds to 385 pounds. The actual different weights were 
(frequencies are in parentheses) 205(3) 215(3) 225(1) 241(2) 252(2) 
265(2) 275(2) 313(2) 316(5) 338(2) 341(1) 345(2) 368(2) 385(1). 
(Source: data from Reuben Davis, Kraig Evans, and Scott 
Gunderson. ) 


Conduct a hypothesis test using a 2.5% level of significance to 
determine if the bench press mean is more than 275 pounds. 


Solution: 
Set up the Hypothesis Test: 


Since the problem is about a mean weight, this is a test of a single 
population mean. 


Hi — 2 J aT eT hS) This is a right-tailed test. 
Calculating the distribution needed: 


Random variable: X = the mean weight, in pounds, lifted by the 
football players. 


Distribution for the test: It is normal because o is known. 


x = 286.2 pounds (from the data). 


o = 55 pounds (Always use o if you know it.) We assume fz = 275 
pounds unless our data shows us otherwise. 


Calculate the p-value using the normal distribution for a mean and 
using the sample mean as input (see the calculator instructions below 
for using the data as input): 


p-value = P( % > 286.2) = 0.1323. 


Interpretation of the p-value: If H, is true, then there is a 0.1331 
probability (13.23%) that the football players can lift a mean weight 
of 286.2 pounds or more. Because a 13.23% chance is large enough, a 
mean weight lift of 286.2 pounds or more is not a rare event. 


x= 286.2 p-value = 0.1323 
w= 275 


275 286.2 x 


Compare a and the p-value: 
w= 0025 p-value = 0.1323 
Make a decision: Since a<p-value, do not reject H,. 


Conclusion: At the 2.5% level of significance, from the sample data, 
there is not sufficient evidence to conclude that the true mean weight 
lifted is more than 275 pounds. 


The p-value can easily be calculated using the TI-83+ and the TI-84 
calculators: 


Put the data and frequencies into lists. Press STAT and arrow over to 
TESTS. Press 1:Z-Test. Arrow over to Data and press ENTER. 


Arrow down and enter 275 for 49, 55 for a, the name of the list where 
you put the data, and the name of the list where you put the 
frequencies. Arrow down to p : and arrow over to > [Up. Press 
ENTER. Arrow down to Calculate and press ENTER. The 
calculator not only calculates the p-value (p = 0.1331, a little 
different from the above calculation - in it we used the sample mean 
rounded to one decimal place instead of the data) but it also calculates 
the test statistic (z-score) for the sample mean, the sample mean, and 
the sample standard deviation. 4 > 275 is the alternate hypothesis. 
Do this set of instructions again except arrow to Dr aw (instead of 
Calculate). Press ENTER. A shaded graph appears with z = 1.112 
(test statistic) and p = 0.1331 (p-value). Make sure when you use 
Draw that no other equations are highlighted in Y = and the plots are 
turned off. 


Example: 
Exercise: 


Problem: 


Statistics students believe that the mean score on the first statistics test 
is 65. A statistics instructor thinks the mean score is higher than 65. 
He samples ten statistics students and obtains the scores 65 65 70 67 
66 63 63 68 72 71. He performs a hypothesis test using a 5% level of 
significance. The data are from a normal distribution. 


Solution: 
Set up the Hypothesis Test: 


A 5% level of significance means that a = 0.05. This is a test of a 
single population mean. 


Heo — op aie = 65 


Since the instructor thinks the average score is higher, use a ">". The 
">" means the test is right-tailed. 


Determine the distribution needed: 

Random variable: X = average score on the first statistics test. 
Distribution for the test: If you read the problem carefully, you will 
notice that there is no population standard deviation given. You are 
only given n = 10 sample data values. Notice also that the data come 
from a normal distribution. This means that the distribution for the 


test is a student's-t. 


Use tar. Therefore, the distribution for the test is tj) where n = 10 and 
df =10—-1=9. 


Calculate the p-value using the Student's-t distribution: 


p-value = P( % > 67 )= 0.0396 where the sample mean and sample 
standard deviation are calculated as 67 and 3.1972 from the data. 


Interpretation of the p-value: If the null hypothesis is true, then 


there is a 0.0396 probability (3.96%) that the sample mean is 67 or 
more. 


p-value = 0.0396 


65 67 * 


Compare q@ and the p-value: 


Since a = .05 and p-value = 0.0396. Therefore, a > p-value. 


Make a decision: Since a > p-value, reject Ho. 


This means you reject p42 = 65. In other words, you believe the 
average test score is more than 65. 


Conclusion: At a 5% level of significance, the sample data show 
sufficient evidence that the mean (average) test score is more than 65, 
just as the math instructor thinks. 


The p-value can easily be calculated using the TI-83+ and the TI-84 
calculators: 


Put the data into a list. Press STAT and arrow over to TESTS. Press 
2:T-Test. Arrow over to Data and press ENTER. Arrow down and 
enter 65 for jz, the name of the list where you put the data, and 1 for 
Freq:. Arrow down to p : and arrow over to > fo. Press ENTER. 
Arrow down to Calculate and press ENTER. The calculator not 
only calculates the p-value (p = 0.0396) but it also calculates the test 
Statistic (t-score) for the sample mean, the sample mean, and the 
sample standard deviation. jz > 65 is the alternate hypothesis. Do this 
set of instructions again except arrow to Dr aw (instead of 
Calculate). Press ENTER. A shaded graph appears with 

t = 1.9781 (test statistic) and p = 0.0396 (p-value). Make sure when 
you use Draw that no other equations are highlighted in Y = and the 
plots are turned off. 


Example: 
Exercise: 


Problem: 


Joon believes that 50% of first-time brides in the United States are 
younger than their grooms. She performs a hypothesis test to 
determine if the percentage is the same or different from 50%. Joon 
samples 100 first-time brides and 53 reply that they are younger than 
their grooms. For the hypothesis test, she uses a 1% level of 
significance. 


Solution: 
Set up the Hypothesis Test: 


The 1% level of significance means that a = 0.01. This is a test of a 
single population proportion. 


Hep ol) in ipra oe = 0p a9) 


The words "is the same or different from" tell you this is a two- 
tailed test. 


Calculate the distribution needed: 


Random variable: P/ = the percent of of first-time brides who are 
younger than their grooms. 


Distribution for the test: The problem contains no mention of a 


mean. The information is given in terms of percentages. Use the 
distribution for P/, the estimated proportion. 


Pi~N (», 4/ = Therefore, Pr ~ N (0, ae where 
p— 0: 50eg— bp 0 50 anda — 100; 
Calculate the p-value using the normal distribution for proportions: 


p-value = P(p’< 0.47 or p’ > 0.53 ) = 0.5485 


rae ee Se 
where £ = 53, pl= — = ito = 0.53. 


Interpretation of the p-value: If the null hypothesis is true, there is 
0.5485 probability (54.85%) that the sample (estimated) proportion p/ 
is 0.53 or more OR 0.47 or less (see the graph below). 


= (p-value) = 0.27425 = (p-value) = 0.27425 


0.47 0.50 0.53 


jt = p = 0.50 comes from A, the null hypothesis. 


pl= 0.53. Since the curve is symmetrical and the test is two-tailed, 
the p/ for the left tail is equal to 0.50 — 0.03 = 0.47 where 
pL = p = 0.50. (0.03 is the difference between 0.53 and 0.50.) 


Compare q@ and the p-value: 
Since a = 0.01 and p-value = 0.5485. Therefore, a< p-value. 
Make a decision: Since a<p-value, you cannot reject Ho. 


Conclusion: At the 1% level of significance, the sample data do not 
show sufficient evidence that the percentage of first-time brides that 
are younger than their grooms is different from 50%. 


The p-value can easily be calculated using the TI-83+ and the TI-84 
calculators: 


Press STAT and arrow over to TESTS. Press 5:1-PropZTest. 
Enter .5 for po, 53 for x and 100 for n. Arrow down to Prop and 
arrow tonot equals po. Press ENTER. Arrow down to 


Calculate and press ENTER. The calculator calculates the p-value 
(p = 0.5485) and the test statistic (z-score). Prop not equals .5 
is the alternate hypothesis. Do this set of instructions again except 
arrow to Draw (instead of Calculate). Press ENTER. A shaded 
graph appears with z = 0.6 (test statistic) and p = 0.5485 (p-value). 
Make sure when you use Dr aw that no other equations are 
highlighted in Y = and the plots are turned off. 


The Type I and Type IJ errors are as follows: 


The Type I error is to conclude that the proportion of first-time brides 
that are younger than their grooms is different from 50% when, in 
fact, the proportion is actually 50%. (Reject the null hypothesis when 
the null hypothesis is true). 


The Type II error is there is not enough evidence to conclude that the 
proportion of first time brides that are younger than their grooms 
differs from 50% when, in fact, the proportion does differ from 50%. 
(Do not reject the null hypothesis when the null hypothesis is false.) 


Example: 
Exercise: 


Problem: 


Suppose a consumer group suspects that the proportion of households 
that have three cell phones is 30%. A cell phone company has reason 
to believe that the proportion is 30%. Before they start a big 
advertising campaign, they conduct a hypothesis test. Their marketing 
people survey 150 households with the result that 43 of the 
households have three cell phones. 


Solution: 


Set up the Hypothesis Test: 


Hp ol) ipo) 
Determine the distribution needed: 


The random variable is P/ = proportion of households that have 
three cell phones. 


The distribution for the hypothesis test is P/ ~ N 
(0.30)-(0.70) 
(0.30, ee | 


Exercise: 


Problem: 
The value that helps determine the p-value is p/. Calculate p/. 


Solution: 


pl= = where zx is the number of successes and 7 is the 


total number in the sample. 


L—=—Ads =o) 
es 
P50) 
Exercise: 


Problem: What is a success for this problem? 


Solution: 


A success is having three cell phones in a household. 


Exercise: 


Problem: What is the level of significance? 


Solution: 


The level of significance is the preset a. Since a is not given, 
assume that a = 0.05. 


Draw the graph for this problem. Draw the horizontal axis. Label and 
shade appropriately. 
Exercise: 


Problem: Calculate the p-value. 


Solution: 


p-value = 0.7216 
Exercise: 


Problem: 


Make a decision. (Reject/Do not reject) Ho 
because 


Solution: 


Assuming that a = 0.05, a < p-value. The Decision is do not 
reject Ho because there is not sufficient evidence to conclude 
that the proportion of households that have three cell phones is 
not 30%. 


The next example is a poem written by a statistics student named Nicole 
Hart. The solution to the problem follows the poem. Notice that the 
hypothesis test is for a single population proportion. This means that the 
null and alternate hypotheses use the parameter p. The distribution for the 
test is normal. The estimated proportion p/ is the proportion of fleas killed 
to the total fleas found on Fido. This is sample information. The problem 
gives a preconceived a = 0.01, for comparison, and a 95% confidence 
interval computation. The poem is clever and humorous, so please enjoy it! 


Note:Hypothesis testing problems consist of multiple steps. To help you do 
the problems, solution sheets are provided for your use. Look in the Table 
of Contents Appendix for the topic "Solution Sheets." If you like, use 
copies of the appropriate solution sheet for homework problems. 


Example: 
Exercise: 


Problem: 


My dog has so many fleas, They do not come off 
with ease. As for shampoo, I have tried many 
types Even one called Bubble Hype, Which only 
killed 25% of the fleas, Unfortunately I was 
not pleased. I've used all kinds of soap, Until 
I had give up hope Until one day I saw An ad 
that put me in awe. A shampoo used for dogs 
Called GOOD ENOUGH to Clean a Hog Guaranteed to 
Kill more fleas. I gave Fido a bath And after 
doing the math His number of fleas Started 
dropping by 3's! Before his shampoo I counted 
42. At the end of his bath, I redid the math 
And the new shampoo had killed 17 fleas. So now 
I was pleased. Now it is time for you to have 
some fun With the level of significance being 
.01, You must help me figure out Use the new 
shampoo or go without? 


Solution: 
Set up the Hypothesis Test: 
Jali — Ue ep = 0.25 


Determine the distribution needed: 


In words, CLEARLY state what your random variable X or P’ 
represents. 


P’ = The proportion of fleas that are killed by the new shampoo 


State the distribution to use for the test. 
Normal: w (0.25, je ) 


Test Statistic: z = 2.3163 
Calculate the p-value using the normal distribution for proportions: 
p-value =0.0103 


In 1 — 2 complete sentences, explain what the p-value means for this 
problem. 


If the null hypothesis is true (the proportion is 0.25), then there is a 
0.0103 probability that the sample (estimated) proportion is 0.4048 
(+) or more. 


Use the previous information to sketch a picture of this situation. 
CLEARLY, label and scale the horizontal axis and shade the region(s) 
corresponding to the p-value. 


p' 
0.25 17/42= _ test statistic for 
0.4048 17/42: 2.3163 


Compare q@ and the p-value: 


Indicate the correct decision (“reject” or “do not reject” the null 
hypothesis), the reason for it, and write an appropriate conclusion, 


using COMPERTE SENTENCES: 


alpha decision reason for decision 


0.01 Do not reject H, a<p-value 


Conclusion: At the 1% level of significance, the sample data do not 
show sufficient evidence that the percentage of fleas that are killed by 
the new shampoo is more than 25%. 


Construct a 95% Confidence Interval for the true mean or proportion. 
Include a sketch of the graph of the situation. Label the point estimate 
and the lower and upper bounds of the Confidence Interval. 


p' 


0.26 17/42 0.55 


Confidence Interval: (0.26, 0.55) We are 95% confident that the true 
population proportion p of fleas that are killed by the new shampoo is 
between 26% and 55%. 


Note: This test result is not very definitive since the p-value is very 
close to alpha. In reality, one would probably do more tests by giving 
the dog another bath after the fleas have had a chance to return. 


Glossary 


Central Limit Theorem 
Given a random variable (RV) with known mean p and known 
standard deviation o. We are sampling with size n and we are 
interested in two new RVs - the sample mean, X, and the sample sum, 
3/X. If the size n of the sample is sufficiently large, then X ~ 


N(n, =) and XX ~ N(nu, no). If the size n of the sample is 


sufficiently large, then the distribution of the sample means and the 
distribution of the sample sums will approximate a normal distribution 
regardless of the shape of the population. The mean of the sample 
means will equal the population mean and the mean of the sample 
sums will equal n times the population mean. The standard deviation 
of the distribution of the sample means, Te is called the standard 


error of the mean. 


Standard Deviation 
A number that is equal to the square root of the variance and measures 
how far data values are from their mean. Notation: s for sample 
standard deviation and o for population standard deviation. 


Summary of Formulas 


A, and H, are contradictory. 


If greater than less than 
Hi, equal (=) or equal to or equal to 
has: (>) (S) 

then not equal (+) or greater 

H, than (>) or less than ec os are 
has: (<) 


If a < p-value, then do not reject Ho. 
If a > p-value, then reject H, . 


a is preconceived. Its value is set before the hypothesis test starts. The p- 
value is calculated from the data. 


a = probability of a Type I error = P(Type I error) = probability of rejecting 
the null hypothesis when the null hypothesis is true. 


@ = probability of a Type II error = P(Type II error) = probability of not 
rejecting the null hypothesis when the null hypothesis is false. 


If there is no given preconceived a, then use a = 0.05. 
Types of Hypothesis Tests 


e Single population mean, known population variance (or standard 
deviation): Normal test. 

e Single population mean, unknown population variance (or standard 
deviation): Student's-t test. 

e Single population proportion: Normal test. 


Practice 1: Single Mean, Known Population Standard Deviation 

This module provides a practice of Hypothesis Testing of Single Mean and 
Single Proportion as a part of Collaborative Statistics collection (col10522) 
by Barbara Illowsky and Susan Dean. 


Student Learning Outcomes 


e The student will conduct a hypothesis test of a single mean with 
known population standard deviation. 


Given 


Suppose that a recent article stated that the mean time spent in jail by a 
first-time convicted burglar is 2.5 years. A study was then done to see if the 
mean time has increased in the new century. A random sample of 26 first— 
time convicted burglars in a recent year was picked. The mean length of 
time in jail from the survey was 3 years with a standard deviation of 1.8 
years. Suppose that it is somehow known that the population standard 
deviation is 1.5. Conduct a hypothesis test to determine if the mean length 
of jail time has increased. The distribution of the population is normal. 


Hypothesis Testing: Single Mean 


Exercise: 


Problem: Is this a test of means or proportions? 
Solution: 


Means 


Exercise: 


Problem: State the null and alternative hypotheses. 


e aH: 


e b dZ.:: 


Solution: 


© a Ho: = 2.5 (or, Mozy < 2.5) 
© bA,:w> 2.5 


Exercise: 
Problem: 


Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 


Solution: 
right-tailed 


Exercise: 


Problem: What symbol represents the Random Variable for this test? 


Solution: 


xX 


Exercise: 


Problem: In words, define the Random Variable for this test. 


Solution: 


The mean time spent in jail for 26 first time convicted burglars 
Exercise: 


Problem: 


Is the population standard deviation known and, if so, what is it? 


Solution: 
Yes, 1.5 
Exercise: 
Problem:Calculate the following: 
ea rt = 
e bo= 
ec Sx = 
e d t= 
Solution: 
e a3 
e b1.5 


e c1.8 
e d26 


Exercise: 


Problem: 


Since both o and s, are given, which should be used? In 1 -2 complete 
sentences, explain why. 


Solution: 


oO 


Exercise: 


Problem: State the distribution to use for the hypothesis test. 


Solution: 


Y- 15] 
X-N(2.5,15.) 


Exercise: 
Problem: 
Sketch a graph of the situation. Label the horizontal axis. Mark the 


hypothesized mean and the sample mean Z. Shade the area 
corresponding to the p-value. 


= 
x 


Exercise: 


Problem: Find the p-value. 
Solution: 
0.0446 
Exercise: 
Problem: At a pre-conceived a@ = 0.05, what is your: 


e a Decision: 
e b Reason for the decision: 
e c Conclusion (write out in a complete sentence): 


Solution: 


e aReject the null hypothesis 


Discussion Questions 


Exercise: 


Problem: 


Does it appear that the mean jail time spent for first time convicted 
burglars has increased? Why or why not? 


Practice 2: Single Mean, Unknown Population Standard Deviation 

This module provides a practice of Hypothesis Testing of Single Mean and 
Single Proportion as a part of Collaborative Statistics collection (col10522) 
by Barbara Illowsky and Susan Dean. 


Student Learning Outcomes 


e The student will conduct a hypothesis test of a single mean with 
unknown population standard deviation. 


Given 
A random survey of 75 death row inmates revealed that the mean length of 
time on death row is 17.4 years with a standard deviation of 6.3 years. 


Conduct a hypothesis test to determine if the population mean time on death 
row could likely be 15 years. 


Hypothesis Testing: Single Mean 


Exercise: 


Problem: Is this a test of means or proportions? 
Solution: 
averages 


Exercise: 


Problem: State the null and alternative hypotheses. 


e aZHu: 
e bdH,: 


Solution: 


© ao:yw=15 
© bA yw #15 


Exercise: 
Problem: 


Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 


Solution: 
two-tailed 


Exercise: 


Problem: What symbol represents the Random Variable for this test? 


Solution: 


xX 


Exercise: 


Problem: In words, define the Random Variable for this test. 


Solution: 


the mean time spent on death row for the 75 inmates 
Exercise: 


Problem: 
Is the population standard deviation known and, if so, what is it? 
Solution: 


No 


Exercise: 


Problem: Calculate the following: 
ea rt — 
© b6.3 = 
e cn = 
Solution: 
e al74 


e bs 
e c/5 


Exercise: 
Problem: 
Which test should be used? In 1 -2 complete sentences, explain why. 
Solution: 
t —test 


Exercise: 


Problem: State the distribution to use for the hypothesis test. 
Solution: 


t74 
Exercise: 
Problem: 
Sketch a graph of the situation. Label the horizontal axis. Mark the 


hypothesized mean and the sample mean, %. Shade the area 
corresponding to the p-value. 


ee 
x 
Exercise: 


Problem: Find the p-value. 


Solution: 
0.0015 
Exercise: 

Problem: At a pre-conceived a@ = 0.05, what is your: 
e a Decision: 
e b Reason for the decision: 
e c Conclusion (write out in a complete sentence): 

Solution: 


e aReject the null hypothesis 


Discussion Question 


Does it appear that the mean time on death row could be 15 years? Why or 
why not? 


Practice 3: Single Proportion 

This module provides a practice of Hypothesis Testing of Single Mean and 
Single Proportion as a part of Collaborative Statistics collection (col10522) 
by Barbara Illowsky and Susan Dean. 


Student Learning Outcomes 


e The student will conduct a hypothesis test of a single population 
proportion. 


Given 


The National Institute of Mental Health published an article stating that in 
any one-year period, approximately 9.5 percent of American adults suffer 
from depression or a depressive illness. 
(http://www.nimh.nih.gov/publicat/depression.cfm) Suppose that in a 
survey of 100 people in a certain town, seven of them suffered from 
depression or a depressive illness. Conduct a hypothesis test to determine if 
the true proportion of people in that town suffering from depression or a 
depressive illness is lower than the percent in the general adult American 
population. 


Hypothesis Testing: Single Proportion 


Exercise: 


Problem: Is this a test of means or proportions? 


Solution: 


Proportions 


Exercise: 


Problem: State the null and alternative hypotheses. 


e aH: 
e bd: 
Solution: 


© a Ho:p = 0.095 
© b Aa:p < 0.095 


Exercise: 
Problem: 
Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 
Solution: 
left-tailed 


Exercise: 


Problem: What symbol represents the Random Variable for this test? 


Solution: 
Pr 


Exercise: 


Problem: In words, define the Random Variable for this test. 


Solution: 


the proportion of people in that town surveyed suffering from 
depression or a depressive illness 


Exercise: 


Problem:Calculate the following: 


eat=— 
e bn = 
e Cc p/= 


Solution: 
e a/ 


e b100 
e c0.07 


Exercise: 
Problem: 


Calculate o,,. Make sure to show how you set up the formula. 


Solution: 


0.0293 


Exercise: 


Problem: State the distribution to use for the hypothesis test. 


Solution: 


Normal 
Exercise: 
Problem: 
Sketch a graph of the situation. Label the horizontal axis. Mark the 


hypothesized mean and the sample proportion, p-hat. Shade the area 
corresponding to the p-value. 


P' 
Exercise: 


Problem: Find the p-value 
Solution: 
0.1969 
Exercise: 
Problem: At a pre-conceived a = 0.05, what is your: 


e a Decision: 
¢ b Reason for the decision: 
¢ c Conclusion (write out in a complete sentence): 


Solution: 


e aDo not reject the null hypothesis 


Discusion Question 


Exercise: 
Problem: 
Does it appear that the proportion of people in that town with 


depression or a depressive illness is lower than general adult American 
population? Why or why not? 


Homework 

This module provides a homework of Hypothesis Testing of Single Mean 
and Single Proportion as a part of Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean. 

Exercise: 


Problem: 


Some of the statements below refer to the null hypothesis, some to the 
alternate hypothesis. 


State the null hypothesis, H,, and the alternative hypothesis, Hg, in 
terms of the appropriate parameter (jz or p). 


aThe mean number of years Americans work before retiring is 34. 
bAt most 60% of Americans vote in presidential elections. 

cThe mean starting salary for San Jose State University graduates 
is at least $100,000 per year. 

d29% of high school seniors get drunk each month. 

eFewer than 5% of adults ride the bus to work in Los Angeles. 
fThe mean number of cars a person owns in her lifetime is not 
more than 10. 

gAbout half of Americans prefer to live away from cities, given 
the choice. 

hEuropeans have a mean paid vacation each year of six weeks. 
iThe chance of developing breast cancer is under 11% for women. 
jPrivate universities mean tuition cost is more than $20,000 per 
year. 


Solution: 


a Ho: = 34; Ay:p 4 34 

c H,:u > 100,000 ; H,:u < 100,000 
dA y= 0.29 Aap 0.29 

oH p= 0.50; Hep + 0:50 

LH OAL hp O11 


Exercise: 


Problem: 


For (a) - (j) above, state the Type I and Type II errors in complete 
sentences. 


Solution: 


e alype I error: We conclude that the mean is not 34 years, when it 
really is 34 years. Type II error: We do not conclude that the mean 
is not 34 years, when it is not really 34 years. 

¢ clype I error: We conclude that the mean is less than $100,000, 
when it really is at least $100,000. Type II error: We do not 
conclude that the mean is less than $100,000, when it is really 
less than $100,000. 

e dType I error: We conclude that the proportion of h.s. seniors who 
get drunk each month is not 29%, when it really is 29%. Type II 
error: We do not conclude that the proportion of h.s. seniors that 
get drunk each month is not 29%, when it is really not 29%. 

e iType I error: We conclude that the proportion is less than 11%, 
when it is really at least 11%. Type II error: We do not conclude 
that the proportion is less than 11%, when it really is less than 
11%. 


Exercise: 


Problem: For (a) - (j) above, in complete sentences: 


e aState a consequence of committing a Type I error. 
e bState a consequence of committing a Type II error. 


Note:For each of the word problems, use a solution sheet to do the 
hypothesis test. The solution sheet is found in 14. Appendix (online book 


version: the link is "Solution Sheets"; PDF book version: look under 14.5 
Solution Sheets). Please feel free to make copies of the solution sheets. For 
the online version of the book, it is suggested that you copy the .doc or the 
.pdf files. 


Note:If you are using a student's-t distribution for a homework problem 
below, you may assume that the underlying population is normally 
distributed. (In general, you must first prove that assumption, though.) 


Exercise: 


Problem: 


A particular brand of tires claims that its deluxe tire averages at least 
50,000 miles before it needs to be replaced. From past studies of this 
tire, the standard deviation is known to be 8000. A survey of owners of 
that tire design is conducted. From the 28 tires surveyed, the mean 
lifespan was 46,500 miles with a standard deviation of 9800 miles. Do 
the data support the claim at the 5% level? 


Exercise: 


Problem: 


From generation to generation, the mean age when smokers first start 
to smoke varies. However, the standard deviation of that age remains 
constant of around 2.1 years. A survey of 40 smokers of this 
generation was done to see if the mean starting age is at least 19. The 
sample mean was 18.1 with a sample standard deviation of 1.3. Do the 
data support the claim at the 5% level? 


Solution: 


eez— —2.71 
e £0.0034 
e hDecision: Reject null; Conclusion: uw < 19 


e i (17.449,18.757) 


Exercise: 


Problem: 


The cost of a daily newspaper varies from city to city. However, the 
variation among prices remains steady with a standard deviation of 
20¢. A study was done to test the claim that the mean cost of a daily 
newspaper is $1.00. Twelve costs yield a mean cost of 95¢ with a 
standard deviation of 18¢. Do the data support the claim at the 1% 
level? 


Exercise: 


Problem: 


An article in the San Jose Mercury News stated that students in the 
California state university system take 4.5 years, on average, to finish 
their undergraduate degrees. Suppose you believe that the mean time is 
longer. You conduct a survey of 49 students and obtain a sample mean 
of 5.1 with a sample standard deviation of 1.2. Do the data support 
your claim at the 1% level? 


Solution: 


° e3.5 

¢ f0.0005 

e hDecision: Reject null; Conclusion: > 4.5 
e i (4.7553,5.4447) 


Exercise: 


Problem: 


The mean number of sick days an employee takes per year is believed 
to be about 10. Members of a personnel department do not believe this 
figure. They randomly survey 8 employees. The number of sick days 
they took for the past year are as follows: 12; 4; 15; 3; 11; 8; 6; 8. Let 
x = the number of sick days they took for the past year. Should the 
personnel team believe that the mean number is about 10? 


Exercise: 


Problem: 


In 1955, Life Magazine reported that the 25 year-old mother of three 
worked, on average, an 80 hour week. Recently, many groups have 
been studying whether or not the women's movement has, in fact, 
resulted in an increase in the average work week for women 
(combining employment and at-home work). Suppose a study was 
done to determine if the mean work week has increased. 81 women 
were surveyed with the following results. The sample mean was 83; 
the sample standard deviation was 10. Does it appear that the mean 
work week has increased for women at the 5% level? 


Solution: 


e e2.7 

e £0.0042 

e hDecision: Reject Null 
e i (80.789,85.211) 


Exercise: 


Problem: 


Your statistics instructor claims that 60 percent of the students who 
take her Elementary Statistics class go through life feeling more 
enriched. For some reason that she can't quite figure out, most people 
don't believe her. You decide to check this out on your own. You 
randomly survey 64 of her past Elementary Statistics students and find 
that 34 feel more enriched as a result of her class. Now, what do you 
think? 


Exercise: 


Problem: 


A Nissan Motor Corporation advertisement read, “The average man’s 
I.Q. is 107. The average brown trout’s I.Q. is 4. So why can’t man 
catch brown trout?” Suppose you believe that the brown trout’s mean 
I.Q. is greater than 4. You catch 12 brown trout. A fish psychologist 
determines the I.Q.s as follows: 5; 4; 7; 3; 6; 4; 5; 3; 6; 3; 8; 5. 
Conduct a hypothesis test of your belief. 


Solution: 


e dti 

e e1.96 

e £0.0380 

e hDecision: Reject null when a = 0.05 ; do not reject null when 
a = 0.01 

e i (3.8865,5.9468) 


Exercise: 
Problem: 
Refer to the previous problem. Conduct a hypothesis test to see if your 


decision and conclusion would change if your belief were that the 
brown trout’s mean I.Q. is not 4. 


Exercise: 


Problem: 


According to an article in Newsweek, the natural ratio of girls to boys 
is 100:105. In China, the birth ratio is 100: 114 (46.7% girls). Suppose 
you don’t believe the reported figures of the percent of girls born in 
China. You conduct a study. In this study, you count the number of 
girls and boys born in 150 randomly chosen recent births. There are 60 
girls and 90 boys born of the 150. Based on your study, do you believe 
that the percent of girls born in China is 46.7? 


Solution: 


e e-1.64 
e £0.1000 


e hDecision: Do not reject null 
e i (0.3216,0.4784) 


Exercise: 


Problem: 


A poll done for Newsweek found that 13% of Americans have seen or 
sensed the presence of an angel. A contingent doubts that the percent is 
really that high. It conducts its own survey. Out of 76 Americans 
surveyed, only 2 had seen or sensed the presence of an angel. As a 
result of the contingent’s survey, would you agree with the Newsweek 
poll? In complete sentences, also give three reasons why the two polls 
might give different results. 


Exercise: 


Problem: 


The mean work week for engineers in a start-up company is believed 
to be about 60 hours. A newly hired engineer hopes that it’s shorter. 
She asks 10 engineering friends in start-ups for the lengths of their 
mean work weeks. Based on the results that follow, should she count 
on the mean work week to be shorter than 60 hours? 


Data (length of mean work week): 70; 45; 55; 60; 65; 55; 55; 60; 50; 
aoe 


Solution: 


id dto 

e e-1.33 

e £0.1086 

e hDecision: Do not reject null 
e i (51.886,62.114) 


Exercise: 
Problem: 
Use the “Lap time” data for Lap 4 (see Table of Contents) to test the 


claim that Terri finishes Lap 4, on average, in less than 129 seconds. 
Use all twenty races given. 


Exercise: 
Problem: 
Use the “Initial Public Offering” data (see Table of Contents) to test 
the claim that the mean offer price was $18 per share. Do not use all 


the data. Use your random number generator to randomly survey 15 
prices. 


Note:The following questions were written by past students. They are 
excellent problems! 


Exercise: 


Problem:18. "Asian Family Reunion" by Chau Nguyen 


Every two years it comes around 

We all get together from different towns. 
In my honest opinion 

It's not a typical family reunion 

Not forty, or fifty, or sixty, 

But how about seventy companions! 

The kids would play, scream, and shout 
One minute they're happy, another they'll pout. 
The teenagers would look, stare, and compare 
From how they look to what they wear. 

The men would chat about their business 

That they make more, but never less. 

Money is always their subject 

And there's always talk of more new projects. 
The women get tired from all of the chats 
They head to the kitchen to set out the mats. 
Some would sit and some would stand 

Eating and talking with plates in their hands. 
Then come the games and the songs 

And suddenly, everyone gets along! 

With all that laughter, it's sad to say 

That it always ends in the same old way. 

They hug and kiss and say "good-bye" 

And then they all begin to cry! 

I say that 60 percent shed their tears 

But my mom counted 35 people this year. 

She said that boys and men will always have 
their pride, 

So we won't ever see them cry. 

I myself don't think she's correct, 

So could you please try this problem to see if 
you object? 


Exercise: 


Problem:"The Problem with Angels" by Cyndy Dowling 


Although this problem is wholly mine, 

The catalyst came from the magazine, Time. 
On the magazine cover I did find 

The realm of angels tickling my mind. 


Inside, 69% I found to be 
In angels, Americans do believe. 


Then, it was time to rise to the task, 
Ninety-five high school and college students I 
did ask. 

Viewing all as one group, 

Random sampling to get the scoop. 


So, I asked each to be true, 
"Do you believe in angels?" Tell me, do! 


Hypothesizing at the start, 
Totally believing in my heart 
That the proportion who said yes 
Would be equal on this test. 


Lo and behold, seventy-three did arrive, 
Out of the sample of ninety-five. 

Now your job has just begun, 

Solve this problem and have some fun. 


Solution: 


e e1.65 

e £0.0984 

e hDecision: Do not reject null 
e i (0.6836,0.8533) 


Exercise: 


Problem:"Blowing Bubbles" by Sondra Prull 


Studying stats just made me tense, 
I had to find some sane defense. 
Some light and lifting simple play 
To float my math anxiety away. 


Blowing bubbles lifts me high 

Takes my troubles to the sky. 

POIK! They're gone, with all my stress 
Bubble therapy is the best. 


The label said each time I blew 

The average number of bubbles would be at least 
22. 

I blew and blew and this I found 

From 64 blows, they all are round! 


But the number of bubbles in 64 blows 
Varied widely, this I know. 

20 per blow became the mean 

They deviated by 6, and not 16. 


From counting bubbles, I sure did relax 
But now I give to you your task. 

Was 22 a reasonable guess? 

Find the answer and pass this test! 


Exercise: 


Problem:21. "Dalmatian Damation" by Kathy Sparling 


A greedy dog breeder named Spreckles 
Bred puppies with numerous freckles 
The Dalmatians he sought 


Possessed spot upon spot 
The more spots, he thought, the more shekels. 


His competitors did not agree 

That freckles would increase the fee. 
They said, “Spots are quite nice 

But they don't affect price; 

One should breed for improved pedigree.” 


The breeders decided to prove 
This strategy was a wrong move. 
Breeding only for spots 

Would wreak havoc, they thought. 
His theory they want to disprove. 


They proposed a contest to Spreckles 
Comparing dog prices to freckles. 

In records they looked up 

One hundred one pups: 

Dalmatians that fetched the most shekels. 


They asked Mr. Spreckles to name 

An average spot count he'd claim 

To bring in big bucks. 

Said Spreckles, “Well, shucks, 

It's for one hundred one that I aim.” 


Said an amateur statistician 

Who wanted to help with this mission. 
“Twenty-one for the sample 

Standard deviation's ample: 


They examined one hundred and one 
Dalmatians that fetched a good sum. 
They counted each spot, 

Mark, freckle and dot 

And tallied up every one. 


Instead of one hundred one spots 

They averaged ninety six dots 

Can they muzzle Spreckles’ 

Obsession with freckles 

Based on all the dog data they've got? 


Solution: 


e e-2.39 

e £0.0093 

e hDecision: Reject null 
¢ i (91.854,100.15) 


Exercise: 


Problem: 


"Macaroni and Cheese, please!!" by Nedda Misherghi and Rachelle 
Hall 


As a poor starving student I don't have much money to spend for even 
the bare necessities. So my favorite and main staple food is macaroni 
and cheese. It's high in taste and low in cost and nutritional value. 


One day, as I sat down to determine the meaning of life, I got a serious 
craving for this, oh, so important, food of my life. So I went down the 
street to Greatway to get a box of macaroni and cheese, but it was SO 
expensive! $2.02 !!! Can you believe it? It made me stop and think. 
The world is changing fast. I had thought that the mean cost of a box 
(the normal size, not some super-gigantic-family-value-pack) was at 
most $1, but now I wasn't so sure. However, I was determined to find 
out. I went to 53 of the closest grocery stores and surveyed the prices 
of macaroni and cheese. Here are the data I wrote in my notebook: 
Price per box of Mac and Cheese: 


e 5 stores @ $2.02 
e 15 stores @ $0.25 


e 3stores @ $1.29 
e 6 stores @ $0.35 
e Astores @ $2.27 
e 7 stores @ $1.50 
e 5 stores @ $1.89 
e 8 stores @ 0.75. 


I could see that the costs varied but I had to sit down to figure out 
whether or not I was right. If it does turn out that this mouth-watering 
dish is at most $1, then I'll throw a big cheesy party in our next 
statistics lab, with enough macaroni and cheese for just me. (After all, 
as a poor starving student I can't be expected to feed our class of 
animals!) 


Exercise: 
Problem: 
"William Shakespeare: The Tragedy of Hamlet, Prince of Denmark" 


by Jacqueline Ghodsi 
THE CHARACTERS (in order of appearance): 


¢ HAMLET, Prince of Denmark and student of Statistics 
e POLONIUS, Hamlet’s tutor 
e HOROTIO, friend to Hamlet and fellow student 


Scene: The great library of the castle, in which Hamlet does his lessons 
Act I 


(The day is fair, but the face of Hamlet is clouded. He paces the large 
room. His tutor, Polonius, is reprimanding Hamlet regarding the 
latter’s recent experience. Horatio is seated at the large table at right 
stage.) 


POLONIUS: My Lord, how cans’t thou admit that thou hast seen a 
ghost! It is but a figment of your imagination! 


HAMLET: I beg to differ; I know of a certainty that five-and-seventy 
in one hundred of us, condemned to the whips and scorns of time as 
we are, have gazed upon a spirit of health, or goblin damn’d, be their 
intents wicked or charitable. 


POLONIUS If thou doest insist upon thy wretched vision then let me 
invest your time; be true to thy work and speak to me through the 
reason of the null and alternate hypotheses. (He turns to Horatio.) Did 
not Hamlet himself say, “What piece of work is man, how noble in 
reason, how infinite in faculties? Then let not this foolishness persist. 
Go, Horatio, make a survey of three-and-sixty and discover what the 
true proportion be. For my part, I will never succumb to this fantasy, 
but deem man to be devoid of all reason should thy proposal of at least 
five-and-seventy in one hundred hold true. 


HORATIO (to Hamlet): What should we do, my Lord? 
HAMLET: Go to thy purpose, Horatio. 
HORATIO: To what end, my Lord? 


HAMLET: That you must teach me. But let me conjure you by the 
rights of our fellowship, by the consonance of our youth, but the 
obligation of our ever-preserved love, be even and direct with me, 
whether I am right or no. 


(Horatio exits, followed by Polonius, leaving Hamlet to ponder alone.) 
Act Il 


(The next day, Hamlet awaits anxiously the presence of his friend, 
Horatio. Polonius enters and places some books upon the table just a 
moment before Horatio enters.) 


POLONIUS: So, Horatio, what is it thou didst reveal through thy 
deliberations? 


HORATIO: In a random survey, for which purpose thou thyself sent 
me forth, I did discover that one-and-forty believe fervently that the 
spirits of the dead walk with us. Before my God, I might not this 
believe, without the sensible and true avouch of mine own eyes. 


POLONIUS: Give thine own thoughts no tongue, Horatio. (Polonius 
turns to Hamlet.) But look to’t I charge you, my Lord. Come Horatio, 
let us go together, for this is not our test. (Horatio and Polonius leave 
together.) 


HAMLET: To reject, or not reject, that is the question: whether ‘tis 
nobler in the mind to suffer the slings and arrows of outrageous 
Statistics, or to take arms against a sea of data, and, by opposing, end 
them. (Hamlet resignedly attends to his task.) 


(Curtain falls) 


Solution: 


e e-1.82 
e (0.0345 


e hDecision: Do not reject null 
e i (0.5331,0.7685) 


Exercise: 


Problem:"Untitled" by Stephen Chen 


I've often wondered how software is released and sold to the public. 
Ironically, I work for a company that sells products with known 
problems. Unfortunately, most of the problems are difficult to create, 
which makes them difficult to fix. I usually use the test program X, 
which tests the product, to try to create a specific problem. When the 
test program is run to make an error occur, the likelihood of generating 
an error is 1%. 


So, armed with this knowledge, I wrote a new test program Y that will 
generate the same error that test program X creates, but more often. To 
find out if my test program is better than the original, so that I can 
convince the management that I'm right, I ran my test program to find 
out how often I can generate the same error. When I ran my test 
program 50 times, I generated the error twice. While this may not 
seem much better, I think that I can convince the management to use 
my test program instead of the original test program. Am I right? 


Exercise: 


Problem: Japanese Girls’ Names 
by Kumi Furuichi 


It used to be very typical for Japanese girls’ names to end with “ko.” 
(The trend might have started around my grandmothers’ generation 
and its peak might have been around my mother’s generation.) “Ko” 
means “child” in Chinese character. Parents would name their 
daughters with “ko” attaching to other Chinese characters which have 
meanings that they want their daughters to become, such as Sachiko — 
a happy child, Yoshiko — a good child, Yasuko — a healthy child, and 
sO on. 


However, I noticed recently that only two out of nine of my Japanese 
girlfriends at this school have names which end with “ko.” More and 
more, parents seem to have become creative, modernized, and, 
sometimes, westernized in naming their children. 


I have a feeling that, while 70 percent or more of my mother’s 
generation would have names with “ko” at the end, the proportion has 
dropped among my peers. I wrote down all my Japanese friends’, ex- 
classmates’, co-workers, and acquaintances’ names that I could 
remember. Below are the names. (Some are repeats.) Test to see if the 
proportion has dropped for this generation. 


Ai, Akemi, Akiko, Ayumi, Chiaki, Chie, Eiko, Eri, Eriko, Fumiko, 
Harumi, Hitomi, Hiroko, Hiroko, Hidemi, Hisako, Hinako, Izumi, 


Izumi, Junko, Junko, Kana, Kanako, Kanayo, Kayo, Kayoko, Kazumi, 
Keiko, Keiko, Kei, Kumi, Kumiko, Kyoko, Kyoko, Madoka, Maho, 
Mai, Maiko, Maki, Miki, Miki, Mikiko, Mina, Minako, Miyako, 
Momoko, Nana, Naoko, Naoko, Naoko, Noriko, Rieko, Rika, Rika, 
Rumiko, Rei, Reiko, Reiko, Sachiko, Sachiko, Sachiyo, Saki, Sayaka, 
Sayoko, Sayuri, Seiko, Shiho, Shizuka, Sumiko, Takako, Takako, 
Tomoe, Tomoe, Tomoko, Touko, Yasuko, Yasuko, Yasuyo, Yoko, 
Yoko, Yoko, Yoshiko, Yoshiko, Yoshiko, Yuka, Yuki, Yuki, Yukiko, 
Yuko, Yuko. 


Solution: 


ez = —2.99 

f0.0014 

hDecision: Reject null; Conclusion: p < .70 
i (0.4529,0.6582) 


Exercise: 


Problem: Phillip’s Wish by Suzanne Osorio 


My nephew likes to play 

Chasing the girls makes his day. 

He asked his mother 

If it is okay 

To get his ear pierced. 

She said, “No way!” 

To poke a hole through your ear, 

Is not what I want for you, dear. 

He argued his point quite well, 

Says even my macho pal, Mel, 

Has gotten this done. 

It’s all just for fun. 

C’mon please, mom, please, what the hell. 
Again Phillip complained to his mother, 
Saying half his friends (including their 
brothers) 


Are piercing their ears 

And they have no fears 

He wants to be like the others. 

She said, “I think it’s much less. 

We must do a hypothesis test. 

And if you are right, 

I won’t put up a fight. 

But, if not, then my case will rest.” 
We proceeded to call fifty guys 

To see whose prediction would fly. 
Nineteen of the fifty 

Said piercing was nifty 

And earrings they’d occasionally buy. 
Then there’s the other thirty-one, 
Who said they’d never have this done. 
So now this poem’s finished. 

Will his hopes be diminished, 

Or will my nephew have his fun? 


Exercise: 


Problem: The Craven by Mark Salangsang 


Once upon a morning dreary 

In stats class I was weak and weary. 
Pondering over last night’s homework 
Whose answers were now on the board 

This I did and nothing more. 


While I nodded nearly napping 
Suddenly, there came a tapping. 

AS someone gently rapping, 

Rapping my head as I snore. 

Quoth the teacher, “Sleep no more.” 


“In every class you fall asleep,” 
The teacher said, his voice was deep. 


“So a tally I’ve begun to keep 
Of every class you nap and snore. 
The percentage being forty-four.” 


“My dear teacher I must confess, 

While sleeping is what I do best. 

The percentage, I think, must be less, 
A percentage less than forty-four.” 
This I said and nothing more. 


“We'll see,” he said and walked away, 
And fifty classes from that day 
He counted till the month of May 
The classes in which I napped and snored. 
The number he found was twenty-four. 


At a Significance level of 0.05, 
Please tell me am I still alive? 
Or did my grade just take a dive 
Plunging down beneath the floor? 
Upon thee I hereby implore. 


Solution: 


e e0.57 
e f0.7156 


e hDecision: Do not reject null 
e i (0.3415,0.6185) 


Exercise: 


Problem: 


Toastmasters International cites a report by Gallop Poll that 40% of 
Americans fear public speaking. A student believes that less than 40% 
of students at her school fear public speaking. She randomly surveys 
361 schoolmates and finds that 135 report they fear public speaking. 
Conduct a hypothesis test to determine if the percent at her school is 
less than 40%. (Source: http:/Atoastmasters.org/artisan/detail.asp? 


Exercise: 


Problem: 


68% of online courses taught at community colleges nationwide were 
taught by full-time faculty. To test if 68% also represents California’s 
percent for full-time faculty teaching the online classes, Long Beach 
City College (LBCC), CA, was randomly selected for comparison. In 
the same year, 34 of the 44 online courses LBCC offered were taught 
by full-time faculty. Conduct a hypothesis test to determine if 68% 
represents CA. NOTE: For more accurate results, use more CA 
community colleges and this past year's data. (Sources: Growing by 
Degrees by Allen and Seaman; Amit Schitai, Director of Instructional 
Technology and Distance Learning, LBCC). 


Solution: 


e e132 

e £0.1873 

e hDecision: Do not reject null 
e i (0.65,0.90) 


Exercise: 


Problem: 


According to an article in Bloomberg Businessweek, New York City's 
most recent adult smoking rate is 14%. Suppose that a survey is 
conducted to determine this year’s rate. Nine out of 70 randomly 
chosen N.Y. City residents reply that they smoke. Conduct a 
hypothesis test to determine if the rate is still 14% or if it has 
decreased. (Source: http:/,www.businessweek.com/news/2011-09- 
15/nyc-smoking-rate-falls-to-record-low-of-14-bloomberg-says.html) 


Exercise: 


Problem: 


The mean age of De Anza College students in a previous term was 26.6 
years old. An instructor thinks the mean age for online students is older 
than 26.6. She randomly surveys 56 online students and finds that the 
sample mean is 29.4 with a standard deviation of 2.1. Conduct a 
hypothesis test. (Source: 


Solution: 


e e9.98 

¢ f0.0000 

e hDecision: Reject null 
e i (28.8,30.0) 


Exercise: 


Problem: 


Registered nurses earned an average annual salary of $69,110. For that 
same year, a survey was conducted of 41 California registered nurses 
to determine if the annual salary is higher than $69,110 for California 
nurses. The sample average was $71,121 with a sample standard 
deviation of $7,489. Conduct a hypothesis test. (Source: 
http://www.bls.gov/oes/current/oes291111.htm) 


Exercise: 


Problem: 


La Leche League International reports that the mean age of weaning a 
child from breastfeeding is age 4 to 5 worldwide. In America, most 
nursing mothers wean their children much earlier. Suppose a random 
survey is conducted of 21 U.S. mothers who recently weaned their 
children. The mean weaning age was 9 months (3/4 year) with a 
standard deviation of 4 months. Conduct a hypothesis test to determine 
if the mean weaning age in the U.S. is less than 4 years old. (Source: 
http://www. lalecheleague.org/Law/BAFeb01.html) 


Solution: 


e e-44.7 

e f0.0000 

e hDecision: Reject null 
e i (0.60,0.90) - in years 


Try these multiple choice questions. 


Exercise: 


Problem: 


When a new drug is created, the pharmaceutical company must subject 
it to testing before receiving the necessary permission from the Food 
and Drug Administration (FDA) to market the drug. Suppose the null 
hypothesis is “the drug is unsafe.” What is the Type II Error? 


e ATo conclude the drug is safe when in, fact, it is unsafe 

¢ BTo not conclude the drug is safe when, in fact, it is safe. 

¢ CTo conclude the drug is safe when, in fact, it is safe. 

¢ DTo not conclude the drug is unsafe when, in fact, it is unsafe 


Solution: 


B 


The next two questions refer to the following information: Over the past 
few decades, public health officials have examined the link between weight 
concerns and teen girls smoking. Researchers surveyed a group of 273 
randomly selected teen girls living in Massachusetts (between 12 and 15 
years old). After four years the girls were surveyed again. Sixty-three (63) 
said they smoked to stay thin. Is there good evidence that more than thirty 
percent of the teen girls smoke to stay thin? 

Exercise: 


Problem:The alternate hypothesis is 


¢ Ap < 0.30 
© Bp < 0.30 
© Cp> 0.30 
¢ Dp> 0.30 


Solution: 


D 


Exercise: 


Problem: After conducting the test, your decision and conclusion are 


e AReject H,: There is sufficient evidence to conclude that more 
than 30% of teen girls smoke to stay thin. 

¢ BDo not reject H,: There is not sufficient evidence to conclude 
that less than 30% of teen girls smoke to stay thin. 

¢ CDo not reject H,: There is not sufficient evidence to conclude 
that more than 30% of teen girls smoke to stay thin. 

e DReject H,: There is sufficient evidence to conclude that less 
than 30% of teen girls smoke to stay thin. 


Solution: 


C 


The next three questions refer to the following information: A statistics 
instructor believes that fewer than 20% of Evergreen Valley College (EVC) 
students attended the opening night midnight showing of the latest Harry 
Potter movie. She surveys 84 of her students and finds that 11 of attended 
the midnight showing. 

Exercise: 


Problem: An appropriate alternative hypothesis is 


¢ Ap= 0.20 
¢ Bp > 0.20 
¢e Cp< 0.20 
¢ Dp < 0.20 


Solution: 


C 


Exercise: 


Problem: At a 1% level of significance, an appropriate conclusion is: 


e AThere is insufficient evidence to conclude that the percent of 
EVC students that attended the midnight showing of Harry Potter 
is less than 20%. 

e BThere is sufficient evidence to conclude that the percent of EVC 
students that attended the midnight showing of Harry Potter is 
more than 20%. 

e CThere is sufficient evidence to conclude that the percent of EVC 
students that attended the midnight showing of Harry Potter is 
less than 20%. 


¢ DThere is insufficient evidence to conclude that the percent of 
EVC students that attended the midnight showing of Harry Potter 
is at least 20%. 


Solution: 


A 
Exercise: 


Problem: 


The Type I error is to conclude that the percent of EVC students who 
attended is 


e Aat least 20%, when in fact, it is less than 20%. 
¢ B20%, when in fact, it is 20%. 

e Cless than 20%, when in fact, it is at least 20%. 

e Dless than 20%, when in fact, it is less than 20%. 


Solution: 


C 


The next two questions refer to the following information: 


It is believed that Lake Tahoe Community College (LTCC) Intermediate 
Algebra students get less than 7 hours of sleep per night, on average. A 
survey of 22 LTCC Intermediate Algebra students generated a mean of 7.24 
hours with a standard deviation of 1.93 hours. At a level of significance of 
5%, do LTCC Intermediate Algebra students get less than 7 hours of sleep 
per night, on average? 

Exercise: 


Problem:The distribution to be used for this test is _X ~ 


¢ A N(7.24,+22) 


? ./22 
¢ B N(7.24,1.93) 
e Cit 
¢ Dt2 
Solution: 
D 
Exercise: 
Problem: 


The Type II error is to not reject that the mean number of hours of 
sleep LTCC students get per night is at least 7 when, in fact, the mean 
number of hours 


e Ais more than 7 hours. 
e Bis at most 7 hours. 

e Cis at least 7 hours. 

e Dis less than 7 hours. 


Solution: 


D 


The next three questions refer to the following information: Previously, 
an organization reported that teenagers spent 4.5 hours per week, on 
average, on the phone. The organization thinks that, currently, the mean is 
higher. Fifteen (15) randomly chosen teenagers were asked how many hours 
per week they spend on the phone. The sample mean was 4.75 hours with a 
sample standard deviation of 2.0. Conduct a hypothesis test. 

Exercise: 


Problem:The null and alternate hypotheses are: 


AH,:x = 4.5, Haix > 4.5 
B Ho: > 4.5 Hai < 4.5 
C H,: = 4.75 Hyp > 4.75 
D How = 4.5 Hou > 4.5 


Solution: 


D 


Exercise: 


Problem: 


At a significance level of a = 0.05, what is the correct conclusion? 


AThere is enough evidence to conclude that the mean number of 
hours is more than 4.75 

BThere is enough evidence to conclude that the mean number of 
hours is more than 4.5 

CThere is not enough evidence to conclude that the mean number 
of hours is more than 4.5 

DThere is not enough evidence to conclude that the mean number 
of hours is more than 4.75 


Solution: 


C 


Exercise: 


Problem:The Type I error is: 


ATo conclude that the current mean hours per week is higher than 
4.5, when in fact, it is higher. 

BTo conclude that the current mean hours per week is higher than 
4.5, when in fact, it is the same. 


¢ CTo conclude that the mean hours per week currently is 4.5, 
when in fact, it is higher. 

¢ DTo conclude that the mean hours per week currently is no higher 
than 4.5, when in fact, it is not higher. 


Solution: 


B 


Review 

This module provides an overview of Hypothesis Testing of Single Mean 
and Single Proportion as a part of Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean. 

Exercise: 


Problem: 
Rebecca and Matt are 14 year old twins. Matt’s height is 2 standard 
deviations below the mean for 14 year old boys’ height. Rebecca’s 


height is 0.10 standard deviations above the mean for 14 year old girls’ 
height. Interpret this. 


e AMatt is 2.1 inches shorter than Rebecca 

e BRebecca is very tall compared to other 14 year old girls. 
e CRebecca is taller than Matt. 

e DMatt is shorter than the average 14 year old boy. 


Solution: 


D 
Exercise: 


Problem: 


Construct a histogram of the IPO data (see Table of Contents, 14. 
Appendix, Data Sets). Use 5 intervals. 


Solution: 


No solution provided. There are several ways in which the histogram 
could be constructed. 


The next three exercises refer to the following information: Ninety 
homeowners were asked the number of estimates they obtained before 
having their homes fumigated. X = the number of estimates. 


x Rel. Freq. Cumulative Rel. Freq. 


1 0.3 
2 0.2 
4 0.4 
rs) 0.1 


Complete the cumulative relative frequency column. 
Exercise: 


Problem: 


Calculate the sample mean (a), the sample standard deviation (b) and 
the percent of the estimates that fall at or below 4 (c). 


Solution: 


e a2.8 
e b1.48 
e c90% 


Exercise: 
Problem: 


Calculate the median, M, the first quartile, Q1, the third quartile, Q3. 
Then construct a boxplot of the data. 


Solution: 


MSS eOle= 1603 4 


Exercise: 


Problem: The middle 50% of the data are between and 


Solution: 


1 and 4 


The next three questions refer to the following table: Seventy 5th and 
6th graders were asked their favorite dinner. 


Fried 
Pizza Hamburgers Spaghetti shrimp 
oth 15 6 9 0 
grader 
6th 15 7 10 8 
grader 
Exercise: 
Problem: 


Find the probability that one randomly chosen child is in the 6th grade 
and prefers fried shrimp. 


Solution: 


D 


Exercise: 


Problem: Find the probability that a child does not prefer pizza. 


e A 2 


Solution: 


C 
Exercise: 


Problem: 


Find the probability a child is in the 5th grade given that the child 
prefers spaghetti. 


eA 
° Ba 
eae 
°Da 


Solution: 


A 


Exercise: 


Problem: A sample of convenience is a random sample. 


e Atrue 


e Bfalse 


Solution: 


B 


Exercise: 


Problem: A statistic is a number that is a property of the population. 


e Atrue 
e Bfalse 


Solution: 


B 


Exercise: 


Problem: You should always throw out any data that are outliers. 


e Atrue 
e Bfalse 


Solution: 


B 
Exercise: 


Problem: 


Lee bakes pies for a small restaurant in Felton, CA. She generally 
bakes 20 pies in a day, on the average. Of interest is the num.ber of 
pies she bakes each day 


e aDefine the Random Variable X. 


e bState the distribution for X. 
e cFind the probability that Lee bakes more than 25 pies in any 
given day. 


Solution: 


° b P(20) 
e c0.1122 


Exercise: 


Problem: 


Six different brands of Italian salad dressing were randomly selected at 
a supermarket. The grams of fat per serving are 7, 7, 9, 6, 8, 5. Assume 
that the underlying distribution is normal. Calculate a 95% confidence 
interval for the population mean grams of fat per serving of Italian 
salad dressing sold in supermarkets. 


Solution: 


CI: (5.52,8.48) 
Exercise: 


Problem: 


Given: uniform, exponential, normal distributions. Match each to a 
statement below. 


e amean = median ~ mode 
e bmean > median > mode 
e cmean = median = mode 


Solution: 


e auniform 


e bexponential 
¢ cnormal 


Hypothesis Testing of Single Mean and Single Proportion: Lab (Edited: 
Teegarden) 02 

update of figure 1 to be a normal curve 

Single Sample Hypothesis Test 


Name: 


I. Student Learning Outcomes: 


e The student will select the appropriate distributions to use in each case. 


e The student will conduct hypothesis tests and interpret the results. 


II Textbook Survey 


The data in the Textbook.mtw lists the cost of 62 books required for a 
sample of classes from Summer 2008 at Mesa College. Students believe 


that they are spending on average $100 for their textbooks. Using the data 


for new books as the sample, conduct a hypothesis test to determine if the 
average cost of new textbooks at Mesa is lower than $100. (a = 0.08) 


Ho: Ha: 

. In words, define the random variable. 

. The distribution to use for the test is: 

. Using Minitab, draw the probability graph and label it appropriately. 

Attach the graph to this lab. 

5. What is the formula for calculating the test statistic? Include the 
values. 

6. Calculate the test statistic using Minitab and include the session 
window output. test statistic = 

7. Determine the p-value using Minitab. p-value = 

8. Do you or do you not reject the null hypothesis? Why? (Use 2 - 3 
complete sentences) 

9. Write a clear conclusion using a complete sentence. 


BRWNPR 


II. Language Survey 


According to the 2000 Census, about 39.5% of Californians and 17.9% of 
all Americans speak a language other than English at home. Using students 
at your college as the sample, conduct a hypothesis test to determine if the 
percent of the students at your school that speak a language other than 
English at home is different from 39.5%. (a = 0.05) (Ensure that your 
sample is large enough to allow for the assumption of normality.) 


Sample data: x = n= 


BRWNPR 


Ho: Ha: 


. In words, define the random variable. 
. The distribution to use for the test is: 
. Using Minitab, draw the probability graph and label it appropriately. 


Attach the graph to this lab. 


. What is the formula for calculating the test statistic? Include the 


values. 


. Calculate the test statistic using Minitab and include the session 


window output. test statistic = 


. Determine the p-value using Minitab. p-value = 
. Do you or do you not reject the null hypothesis? Why? (Use 2 - 3 


complete sentences) 
. Write a clear conclusion using a complete sentence. 


Hypothesis Testing: Two Population Means and Two Population 
Proportions 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Classify hypothesis tests by type. 

e Conduct and interpret hypothesis tests for two population means, 
population standard deviations known. 

e Conduct and interpret hypothesis tests for two population means, 
population standard deviations unknown. 

e Conduct and interpret hypothesis tests for two population proportions. 

¢ Conduct and interpret hypothesis tests for matched or paired samples. 


Introduction 


Studies often compare two groups. For example, researchers are interested 
in the effect aspirin has in preventing heart attacks. Over the last few years, 
newspapers and magazines have reported about various aspirin studies 
involving two groups. Typically, one group is given aspirin and the other 
group is given a placebo. Then, the heart attack rate is studied over several 
years. 


There are other situations that deal with the comparison of two groups. For 
example, studies compare various diet and exercise programs. Politicians 
compare the proportion of individuals from different income brackets who 
might vote for them. Students are interested in whether SAT or GRE 
preparatory courses really help raise their scores. 


In the previous chapter, you learned to conduct hypothesis tests on single 
means and single proportions. You will expand upon that in this chapter. 
You will compare two means or two proportions to each other. The general 
procedure is still the same, just expanded. 


To compare two means or two proportions, you work with two groups. The 
groups are classified either as independent or matched pairs. 


Independent groups mean that the two samples taken are independent, that 
is, sample values selected from one population are not related in any way to 
sample values selected from the other population. Matched pairs consist of 
two samples that are dependent. The parameter tested using matched pairs 
is the population mean. The parameters tested using independent groups are 
either population means or population proportions. 


Note:This chapter relies on either a calculator or a computer to calculate 
the degrees of freedom, the test statistics, and p-values. TI-83+ and TI-84 
instructions are included as well as the test statistic formulas. When using 
the TI-83+/TI-84 calculators, we do not need to separate two population 
means, independent groups, population variances unknown into large and 
small sample sizes. However, most statistical computer software has the 
ability to differentiate these tests. 


This chapter deals with the following hypothesis tests: 
Independent groups (samples are independent) 


e Test of two population means. 
e Test of two population proportions. 


Matched or paired samples (samples are dependent) 


e Becomes a test of one population mean. 


Comparing Two Independent Population Means with Unknown Population 
Standard Deviations 

This module provides an overview of Comparing Two Independent 
Population Means with Unknown Population Standard Deviations as a part 
of Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 


1. The two independent samples are simple random samples from two 
distinct populations. 

2. Both populations are normally distributed with the population means 
and standard deviations unknown unless the sample sizes are greater 
than 30. In that case, the populations need not be normally distributed. 


Note:The test comparing two independent population means with 
unknown and possibly unequal population standard deviations is called the 
Aspin-Welch t-test. The degrees of freedom formula was developed by 
Aspin-Welch. 


The comparison of two population means is very common. A difference 
between the two samples depends on both the means and the standard 
deviations. Very different means can occur by chance if there is great 
variation among the individual samples. In order to account for the 
variation, we take the difference of the sample means, X, - X92 , and divide 
by the standard error (shown below) in order to standardize the difference. 
The result is a t-score test statistic (shown below). 


Because we do not know the population standard deviations, we estimate 
them using the two sample standard deviations from our independent 
samples. For the hypothesis test, we calculate the estimated standard 


deviation, or standard error, of the difference in sample means, Diced 


Equation: 
The standard error is: 


The test statistic (t-score) is calculated as follows: 


Equation: 
t-score 
(1 — £2) — (wa — be) 
($1)? 4 (S2)? 
ny ne 
where: 


e sj, and Sg, the sample standard deviations, are estimates of 7, and oo, 
respectively. 

¢ o, and o>» are the unknown population standard deviations. 

e¢ x, and £2 are the sample means. jz; and j/2 are the population means. 


The degrees of freedom (df) is a somewhat complicated calculation. 
However, a computer or calculator calculates it easily. The dfs are not 
always a whole number. The test statistic calculated above is approximated 
by the student's-t distribution with dfs as follows: 
Equation: 

Degrees of freedom 


When both sample sizes n1 and 2 are five or larger, the student's-t 
approximation is very good. Notice that the sample variances s;7 and s2” 
are not pooled. (If the question comes up, do not pool the variances.) 


Note: It is not necessary to compute this by hand. A calculator or computer 
easily computes it. 


Example: 

Independent groups 

The average amount of time boys and girls ages 7 through 11 spend 
playing sports each day is believed to be the same. An experiment is done, 
data is collected, resulting in the table below. Both populations have a 
normal distribution. 


Average Number of Sample 
Sample Hours Playing Sports Standard 
Size Per Day Deviation 
Girls 9 2 hours V0.75 
Boys 16 3.2 hours 1.00 
Exercise: 
Problem: 


Is there a difference in the mean amount of time boys and girls ages 7 
through 11 play sports each day? Test at the 5% level of significance. 


Solution: 


The population standard deviations are not known. Let g be the 
subscript for girls and 6 be the subscript for boys. Then, jz, is the 


population mean for girls and py is the population mean for boys. 
This is a test of two independent groups, two population means. 


Random variable: X, — X;, = difference in the sample mean amount 
of time girls and boys play sports each day. 


Fy: [bg = be Lg — Lp = 9 
Hq: bg F bb Mg — Hy #0 
The words "the same" tell you H, has an "=". Since there are no 


other words to indicate H,, then assume "is different." This is a two- 
tailed test. 


Distribution for the test: Use ta where df is calculated using the df 
formula for independent groups, two population means. Using a 


calculator, df is approximately 18.8462. Do not pool the variances. 


Calculate the p-value using a student's-t distribution: p-value = 
0.0054 


Graph: 


l l 
> (p-value) = 0.0028 5 (p-value) = 0.0028 


f= ie 

Seal 

SO Lg Oh — ee re 

Half the p-value is below -1.2 and half is above 1.2. 
Make a decision: Since a > p-value, reject H,. 

This means you reject 4, = fp. The means are different. 


Conclusion: At the 5% level of significance, the sample data show 
there is sufficient evidence to conclude that the mean number of hours 
that girls and boys aged 7 through 11 play sports per day is different 
(mean number of hours boys aged 7 through 11 play sports per day is 
greater than the mean number of hours played by girls OR the mean 
number of hours girls aged 7 through 11 play sports per day is greater 
than the mean number of hours played by boys). 


Note: TI-83+ and TI-84: Press 
STAT 

. Arrow over to 

TLESES 

and press 

4:2-SampTTest 

. Arrow over to Stats and press 
ENTER 


. Arrow down and enter 


2 


for the first sample mean, 


V0.75 

for Sx1, 

9 

for n1, 

Se 

for the second sample mean, 
1 

for Sx2, and 

16 

for n2. Arrow down to p11: and arrow to 
does not equal 

2. Press 

ENTER 

. Arrow down to Pooled: and 
No 

. Press 


ENTER 


. Arrow down to 
Calculate 


and press 


ENTER 


. The p-value is p = 0.0054, the dfs are approximately 18.8462, and 
the test statistic is -3.14. Do the procedure again but instead of 
Calculate do Draw. 


Example: 

A study is done by a community group in two neighboring colleges to 
determine which one graduates students with more math classes. College A 
samples 11 graduates. Their average is 4 math classes with a standard 
deviation of 1.5 math classes. College B samples 9 graduates. Their 
average is 3.5 math classes with a standard deviation of 1 math class. The 
community group believes that a student who graduates from college A 
has taken more math classes, on the average. Both populations have a 
normal distribution. Test at a 1% significance level. Answer the following 
questions. 

Exercise: 


Problem:Is this a test of two means or two proportions? 
Solution: 


two means 


Exercise: 


Problem: 


Are the populations standard deviations known or unknown? 


Solution: 


unknown 


Exercise: 


Problem: Which distribution do you use to perform the test? 


Solution: 


student's-t 


Exercise: 


Problem: What is the random variable? 
Solution: 
X4— XB 
Exercise: 
Problem: What are the null and alternate hypothesis? 
Solution: 


. Hy: aS bes 
° Ha: a> MB 


Exercise: 


Problem:Is this test right, left, or two tailed? 


Solution: 


right 


Exercise: 


Problem: What is the p-value? 


Solution: 


0.1928 


Exercise: 


Problem:Do you reject or not reject the null hypothesis? 


Solution: 


Do not reject. 


Conclusion: 

At the 1% level of significance, from the sample data, there is not 
sufficient evidence to conclude that a student who graduates from college 
A has taken more math classes, on the average, than a student who 
graduates from college B. 


Glossary 


Degrees of Freedom (df) 
The number of objects in a sample that are free to vary. 


Standard Deviation 
A number that is equal to the square root of the variance and measures 
how far data values are from their mean. Notation: s for sample 
standard deviation and o for population standard deviation. 


Variable (Random Variable) 
A characteristic of interest in a population being studied. Common 
notation for variables are upper case Latin letters X, Y, Z,...; common 
notation for a specific value from the domain (set of all possible values 
of a variable) are lower case Latin letters x, y, z,.... For example, if X 
is the number of children in a family, then x represents a specific 
integer 0, 1, 2, 3, .... Variables in statistics differ from variables in 
intermediate algebra in two following ways. 


e The domain of the random variable (RV) is not necessarily a 
numerical set; the domain may be expressed in words; for 
example, if X = hair color then the domain is {black, blond, gray, 
green, orange}. 

e We can tell what specific value x of the Random Variable X takes 
only after performing the experiment. 


Comparing Two Independent Population Means with Known Population 
Standard Deviations 

This module provides an overview of hypothesis testing in situations where 
there are both two independent population means and known population 
standard deviations in statistics. 


Even though this situation is not likely (knowing the population standard 
deviations is not likely), the following example illustrates hypothesis testing 
for independent means, known population standard deviations. The 
sampling distribution for the difference between the means is normal and 
both populations must be normal. The random variable is X; — X>. The 
normal distribution has the following format: 


Equation: 
Normal distribution 
Equation: 
The standard deviation is: 
2 2 
O71 02 
CC) 
Ny no 
Equation: 


Example: 


independent groups, population standard deviations known: The mean 
lasting time of 2 competing floor waxes is to be compared. Twenty floors 
are randomly assigned to test each wax. Both populations have a normal 
distribution. The following table is the result. 


Population 
Sample Mean Number of Standard 
Wax Months Floor Wax Last Deviation 
1 5 0.33 
2 2.9 0.36 
Exercise: 
Problem: 


Does the data indicate that wax 1 is more effective than wax 2? Test 
at a 5% level of significance. 


Solution: 


This is a test of two independent groups, two population means, 
population standard deviations known. 


Random Variable: X; — X> = difference in the mean number of 
months the competing floor waxes last. 


Hy : p44 S po 


je Ea Tees 


The words "is more effective" says that wax 1 lasts longer than 
wax 2, on the average. "Longer" is a //>// symbol and goes into Hg. 
Therefore, this is a right-tailed test. 


Distribution for the test: The population standard deviations are 
known so the distribution is normal. Using the formula above, the 
distribution is: 


Ge = VGN (0. ie Le) 
Since j4j < jsg then yz — fg < O and the mean for the normal 
distribution is 0. 


Calculate the p-value using the normal distribution: p-value = 
O99 


Graph: 
p-value = 0.1799 


X1-X2 


0 0.1 
From H.: wy —p2 < 0 


09 — 9 2) — Ol 


Compare a and the p-value: a = 0.05 and p-value = 0.1799. 
Therefore, @ < p-value. 


Make a decision: Since a < p-value, do not reject Ho. 


Conclusion: At the 5% level of significance, from the sample data, 
there is not sufficient evidence to conclude that the mean time wax 1 
lasts is longer (wax 1 is more effective) than the mean time wax 2 
lasts. 


Note: TI-83+ and TI-84: Press 
STAT 

. Arrow over to 

TESTS 

and press 
3:2-SampZTest 

. Arrow over to 

Stats 

and press 

ENTER 

. Arrow down and enter 
aoe 

for sigmal, 

. 36 

for sigma2, 


3 


for the first sample mean, 


20 


for the second sample mean, and 

20 

for n2. Arrow down to j1: and arrow to > p22. Press 
ENTER 

. Arrow down to 

Calculate 

and press 

ENTER 


. The p-value is p = 0.1799 and the test statistic is 0.9157. Do the 
procedure again but instead of 


Calculate 
do 


Draw 


Comparing Two Independent Population Proportions 


1. The two independent samples are simple random samples that are 
independent. 

2. The number of successes is at least five and the number of failures is at 
least five for each of the samples. 


Comparing two proportions, like comparing two means, is common. If two 
estimated proportions are different, it may be due to a difference in the 
populations or it may be due to chance. A hypothesis test can help 
determine if a difference in the estimated proportions (P’, — P'g) reflects a 


difference in the population proportions. 


The difference of two proportions follows an approximate normal 
distribution. Generally, the null hypothesis states that the two proportions 
are the same. That is, H, : p4 = pp. To conduct the test, we use a pooled 
proportion, pe. 
Equation: 

The pooled proportion is calculated as follows: 


LAtTXLB 
NA+nNB 


c — 


Equation: 
The distribution for the differences is: 


1 1 
Pra — Ptp-N 0, >: (1 =p.) (= ie =| 


Equation: 
The test statistic (z-score) is: 


_ (p!4 — ple) — (pa — ps) 


pe-(1-pe)- (2 +2) 


Example: 

Two population proportions 

Two types of medication for hives are being tested to determine if there is a 
difference in the proportions of adult patient reactions. Twenty out of a 
random sample of 200 adults given medication A still had hives 30 
minutes after taking the medication. Twelve out of another random 
sample of 200 adults given medication B still had hives 30 minutes after 
taking the medication. Test at a 1% level of significance. 


Determining the solution 


This is a test of 2 population proportions. 
Exercise: 


Problem: How do you know? 


Solution: 


The problem asks for a difference in proportions. 


Let A and B be the subscripts for medication A and 
medication B. Then py, and pz are the desired population 
proportions. 


Random Variable: 

P’, — P’p = difference in the proportions of adult patients 
who did not react after 30 minutes to medication A and 
medication B. 


Jel NW =o pA — pa =U 
Hy: pa # PB PA— pp #FO 
The words "is a difference" tell you the test is two-tailed. 


Distribution for the test: Since this is a test of two 
binomial population proportions, the distribution is normal: 


_ fates — 204127 ion 
Pe = nating = 2001200 — 9-08 1—pe= 0.92 


Therefore, 
P’4— Pp 0, \/ (0.08) - (0.92) - (so + say) 
P’ 4 — P’z follows an approximate normal distribution. 


Calculate the p-value using the normal distribution: p- 
value = 0.1404. 


Estimated proportion for group A: 
ee ee all 
PA 7a = 200 


Estimated proportion for group B: 
Se] | =a Se 


1 
1... a I 
5 eee = ee = (p-value) = 0.0702 


~0.04 0 0.04 A 
From Ho, pa - pgp = 9. 


Po Pos — Oe 0.06. — 004 


Half the p-value is below -0.04 and half is above 0.04. 


Compare a and the p-value: a = 0.01 and the 
p-value = 0.1404. a < p-value. 


Make a decision: Since a@ < p-value, do not reject H. 


Conclusion: At a 1% level of significance, from the sample 
data, there is not sufficient evidence to conclude that there is 
a difference in the proportions of adult patients who did not 

react after 30 minutes to medication A and medication B. 


Note: TI-83+ and TI-84: Press 
STAT 

. Arrow over to 

MEous 

and press 
6:2-PropZTest 

. Arrow down and enter 
20 

for xl, 

200 

for nl, 

aL. 

for x2, and 


200 


for n2. Arrow down to 
p1 

: and arrow to 

not equal p2 

. Press 

ENTER 

. Arrow down to 
Calculate 

and press 

ENTER 


. The p-value is p = 0.1404 and the test statistic is 1.47. 
Do the procedure again but instead of 


Calculate 
do 


Draw 


Matched or Paired Samples 
This module provides an overview of Hypothesis Testing: Matched or Paired Samples as a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and Susan Dean. 


. Simple random sampling is used. 

. Sample sizes are often small. 

Two measurements (samples) are drawn from the same pair of individuals or objects. 

. Differences are calculated from the matched or paired samples. 

. The differences form the sample that is used for the hypothesis test. 

. The matched pairs have differences that either come from a population that is normal or the number of 
differences is sufficiently large so the distribution of the sample mean of differences is approximately 
normal. 


AunRWNER 


In a hypothesis test for matched or paired samples, subjects are matched in pairs and differences are calculated. 
The differences are the data. The population mean for the differences, jg, is then tested using a Student-t test 
for a single population mean with n — 1 degrees of freedom where n is the number of differences. 
Equation: 

The test statistic (t-score) is: 


Example: 

Matched or paired samples 

A study was conducted to investigate the effectiveness of hypnotism in reducing pain. Results for randomly 
selected subjects are shown in the table. The "before" value is matched to an "after" value and the differences 
are calculated. The differences have a normal distribution. 


Subject: A B C D E F G H 

Before 6.6 6.5 9.0 10.3 11.3 8.1 6.3 11.6 

After 6.8 2.4 7.4 8.5 8.1 6.1 3.4 2.0 
Exercise: 

Problem: 


Are the sensory measurements, on average, lower after hypnotism? Test at a 5% significance level. 


Solution: 


Corresponding "before" and "after" values form matched pairs. (Calculate "sfter" - "before"). 


After Data Before Data Difference 


6.8 6.6 0.2 
2.4 6.5 -4.1 
7.4 9 -1.6 
8.5 10.3 -1.8 
8.1 11.3 -3.2 
6.1 8.1 -2 

3.4 6.3 -2.9 
2 11.6 -9.6 


The data for the test are the differences: {0.2, -4.1, -1.6, -1.8, -3.2, -2, -2.9, -9.6} 


The sample mean and sample standard deviation of the differences are: i = —S, Sail Gy = ZO 
Verify these values. 


Let jg be the population mean for the differences. We use the subscript d to denote "differences." 


Random Variable: X, = the mean difference of the sensory measurements 
Equation: 


Ay: pa > 9 
There is no improvement. (jg is the population mean of the differences.) 
Equation: 

Ay: pa< 0 


There is improvement. The score should be lower after hypnotism so the difference ought to be negative 
to indicate improvement. 


Distribution for the test: The distribution is a student-t with df = n — 1 = 8 — 1 = 7. Use tz. (Notice 
that the test is for a single population mean.) 


Calculate the p-value using the Student-t distribution: p-value = 0.0095 


Graph: 


p-value = 0.0095 


—3.13 0 
From H,, Hd 2 0 


Xq is the random variable for the differences. 

The sample mean and sample standard deviation of the differences are: 

fiGgj = Silla} 

$q= 2.91 

Compare a and the p-value: a = 0.05 and p-value = 0.0095. a > p-value. 

Make a decision: Since a > p-value, reject Ho. 

This means that q < O and there is improvement. 

Conclusion: At a 5% level of significance, from the sample data, there is sufficient evidence to conclude 


that the sensory measurements, on average, are lower after hypnotism. Hypnotism appears to be effective 
in reducing pain. 


Note:For the TI-83+ and TI-84 calculators, you can either calculate the differences ahead of time (after 
- before) and put the differences into a list or you can put the after data into a first list and the before 
data into a second list. Then go to a third list and arrow up to the name. Enter 1st list name - 2nd list 
name. The calculator will do the subtraction and you will have the differences in the third list. 


Note:TI-83+ and TI-84: Use your list of differences as the data. Press 


STAT 

and arrow over to 
TESTS 

. Press 
2:T-Test 

. Arrow over to 
Data 


and press 


ENTER 

. Arrow down and enter 

0 

for 49, the name of the list where you put the data, and 
1 

for Freq:. Arrow down to 

U 

: and arrow over to 

< 

[o. Press 

ENTER 

. Arrow down to 

Calculate 

and press 

ENTER 

. The p-value is 0.0094 and the test statistic is -3.04. Do these instructions again except arrow to 
Draw 

(instead of 

Calculate 

). Press 


ENTER 


Example: 

A college football coach was interested in whether the college's strength development class increased his 
players' maximum lift (in pounds) on the bench press exercise. He asked 4 of his players to participate in a 
study. The amount of weight they could each lift was recorded before they took the strength development 
class. After completing the class, the amount of weight they could each lift was again measured. The data are 
as follows: 


Weight (in pounds) Player 1 Player 2 Player 3 Player 4 
Amount of weighted lifted prior to the class 205 241 338 368 


Amount of weight lifted after the class 295 252 330 360 


The coach wants to know if the strength development class makes his players stronger, on average. 
Exercise: 


Problem: 

Record the differences data. Calculate the differences by subtracting the amount of weight lifted prior to 
the class from the weight lifted after completing the class. The data for the differences are: {90, 11, -8, 
-8}. The differences have a normal distribution. 

Using the differences data, calculate the sample mean and the sample standard deviation. 

Fg 751.83 Sa = 46.7 

Using the difference data, this becomes a test of a single (fill in the blank). 

Define the random variable: Xz = mean difference in the maximum lift per player. 

The distribution for the hypothesis test is ¢3. 

Ay: wa <0 A, : pg > 0 

Graph: 


p-value = 0.2150 


21.3 


Calculate the p-value: The p-value is 0.2150 


Decision: If the level of significance is 5%, the decision is to not reject the null hypothesis because 
a < p-value. 


What is the conclusion? 
Solution: 


means; At a 5% level of significance, from the sample data, there is not sufficient evidence to conclude 
that the strength development class helped to make the players stronger, on average. 


Example: 


Seven eighth graders at Kennedy Middle School measured how far they could push the shot-put with their 
dominant (writing) hand and their weaker (non-writing) hand. They thought that they could push equal 
distances with either hand. The following data was collected. 


Distance 
(in feet) Student Student Student Student Student Student Student 
using 1 2 3 4 5 6 7 


Dominant 


rand 30 26 34 17 19 26 20 


Weaker 


rena 28 14 27 18 17 26 16 


Exercise: 


Problem: 


Conduct a hypothesis test to determine whether the mean difference in distances between the children's 
dominant versus weaker hands is significant. 


Note:use a t-test on the difference data. Assume the differences have a normal distribution. The random 
variable is the mean difference. 


Note:The test statistic is 2.18 and the p-value is 0.0716. 


What is your conclusion? 


Solution: 


Ao: fa equals 0; Hz: 4g does not equal 0; Do not reject the null; At a 5% significance level, from the 
sample data, there is not sufficient evidence to conclude that the mean difference in distances between the 
children's dominant versus weaker hands is significant (there is not sufficient evidence to show that the 
children could push the shot-put further with their dominant hand). Alpha and the p-value are close so the 
test is not strong. 


Summary of Types of Hypothesis Tests 
Two Population Means 


e Populations are independent and population standard deviations are 
unknown. 

e Populations are independent and population standard deviations are 
known (not likely). 


Matched or Paired Samples 


¢ Two samples are drawn from the same set of objects. 
e Samples are dependent. 


Two Population Proportions 


e Populations are independent. 


Practice 1: Hypothesis Testing for Two Proportions 

This module provides a practice of Two Population Means and Two 
Population Proportions as a part of Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean. 


Student Learning Outcomes 


e The student will conduct a hypothesis test of two proportions. 


Given 


In the recent Census, 3 percent of the U.S. population reported being two or 
more races. However, the percent varies tremendously from state to state. 
(Source: http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf) 
Suppose that two random surveys are conducted. In the first random survey, 
out of 1000 North Dakotans, only 9 people reported being of two or more 
races. In the second random survey, out of 500 Nevadans, 17 people 
reported being of two or more races. Conduct a hypothesis test to determine 
if the population percents are the same for the two states or if the percent 
for Nevada is statistically higher than for North Dakota. 


Hypothesis Testing: Two Proportions 


Exercise: 


Problem: Is this a test of means or proportions? 
Solution: 


Proportions 


Exercise: 


Problem: State the null and alternative hypotheses. 


® a Ho: 


e b Aa: 


Solution: 


4 Hfo:pN =pnpD 


" Ha:pn >PND 


Exercise: 


Problem: 

Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 
Solution: 

right-tailed 


Exercise: 


Problem: What is the Random Variable of interest for this test? 


Exercise: 


Problem: In words, define the Random Variable for this test. 
Exercise: 


Problem: 


Which distribution (Normal or student's-t) would you use for this 
hypothesis test? 


Solution: 


Normal 


Exercise: 


Problem: 
Explain why you chose the distribution you did for the above question. 
Exercise: 


Problem: Calculate the test statistic. 


Solution: 


3.50 
Exercise: 


Problem: 


Sketch a graph of the situation. Mark the hypothesized difference and 
the sample difference. Shade the area corresponding to the p —value. 


Exercise: 


Problem: Find the p —value: 


Solution: 


0.0002 


Exercise: 


Problem: At a pre-conceived a = 0.05, what is your: 


e a Decision: 
¢ b Reason for the decision: 
¢ c Conclusion (write out in a complete sentence): 


Solution: 


e aReject the null hypothesis 


Discussion Question 


Exercise: 
Problem: 
Does it appear that the proportion of Nevadans who are two or more 


races is higher than the proportion of North Dakotans? Why or why 
not? 


Practice 2: Hypothesis Testing for Two Averages 

This module provides a practice of Hypothesis Testing: Two Population 
Means and Two Population Proportions: as a part of Collaborative Statistics 
collection (col10522) by Barbara Illowsky and Susan Dean. 


Student Learning Outcome 


e The student will conduct a hypothesis test of two means. 


Given 


The U.S. Center for Disease Control reports that the mean life expectancy 
for whites born in 1900 was 47.6 years and for nonwhites it was 33.0 years. 
(http://www.cdc.gov/nchs/data/dvs/nvsr53_06t12.pdf ) Suppose that you 
randomly survey death records for people born in 1900 in a certain county. 
Of the 124 whites, the mean life span was 45.3 years with a standard 
deviation of 12.7 years. Of the 82 nonwhites, the mean life span was 34.1 
years with a standard deviation of 15.6 years. Conduct a hypothesis test to 
see if the mean life spans in the county were the same for whites and 
nonwhites. 


Hypothesis Testing: Two Means 


Exercise: 


Problem: Is this a test of means or proportions? 


Solution: 
Means 
Exercise: 
Problem: State the null and alternative hypotheses. 


® a Ho: 


e b Aa: 


Solution: 
cay:uw be 
© bHyuy pb 

Exercise: 

Problem: 


Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 
Solution: 
two-tailed 
Exercise: 
Problem: What is the Random Variable of interest for this test? 


Solution: 


Xw xX 
Exercise: 


Problem: 
In words, define the Random Variable of interest for this test. 
Solution: 


The difference between the mean life spans of whites and nonwhites. 


Exercise: 


Problem: 


Which distribution (Normal or student's-t) would you use for this 
hypothesis test? 


Exercise: 


Problem: 

Explain why you chose the distribution you did for the above question. 
Exercise: 

Problem: Calculate the test statistic. 


Solution: 


5.42 
Exercise: 
Problem: 
Sketch a graph of the situation. Label the horizontal axis. Mark the 


hypothesized difference and the sample difference. Shade the area 
corresponding to the p-value. 


< > 


Exercise: 


Problem: Find the p_ value: 


Solution: 


0.0000 
Exercise: 
Problem: At a pre-conceived a , what is your: 


e a Decision: 
e b Reason for the decision: 
¢ c Conclusion (write out in a complete sentence): 


Solution: 


e aReject the null hypothesis 


Discussion Question 


Exercise: 


Problem: 


Does it appear that the means are the same? Why or why not? 


Homework 


For questions [link] - [link], indicate which of the following choices best identifies 
the hypothesis test. 


e Alndependent group means, population standard deviations and/or variances 
known 

e Blndependent group means, population standard deviations and/or variances 
unknown 

e CMatched or paired samples 

e DSingle mean 

e E2 proportions 

e FSingle proportion 


Exercise: 
Problem: 
A powder diet is tested on 49 people and a liquid diet is tested on 36 different 
people. The population standard deviations are 2 pounds and 3 pounds, 


respectively. Of interest is whether the liquid diet yields a higher mean weight 
loss than the powder diet. 


Solution: 


A 
Exercise: 
Problem: 
A new chocolate bar is taste-tested on consumers. Of interest is whether the 


proportion of children that like the new chocolate bar is greater than the 
proportion of adults that like it. 


Exercise: 
Problem: 
The mean number of English courses taken in a two-year time period by male 


and female college students is believed to be about the same. An experiment is 
conducted and data are collected from 9 males and 16 females. 


Solution: 


B 
Exercise: 
Problem: 
A football league reported that the mean number of touchdowns per game was 


5. A study is done to determine if the mean number of touchdowns has 
decreased. 


Exercise: 
Problem: 
A study is done to determine if students in the California state university 
system take longer to graduate than students enrolled in private universities. 
100 students from both the California state university system and private 


universities are surveyed. From years of research, it is known that the 
population standard deviations are 1.5811 years and 1 year, respectively. 


Solution: 


A 
Exercise: 


Problem: 


According to a YWCA Rape Crisis Center newsletter, 75% of rape victims 
know their attackers. A study is done to verify this. 


Exercise: 


Problem: 


According to a recent study, U.S. companies have an mean maternity-leave of 
six weeks. 


Solution: 


D 


Exercise: 


Problem: 


A recent drug survey showed an increase in use of drugs and alcohol among 
local high school students as compared to the national percent. Suppose that a 
survey of 100 local youths and 100 national youths is conducted to see if the 
proportion of drug and alcohol use is higher locally than nationally. 


Exercise: 
Problem: 


A new SAT study course is tested on 12 individuals. Pre-course and post- 
course scores are recorded. Of interest is the mean increase in SAT scores. 


Solution: 


C 
Exercise: 


Problem: 


University of Michigan researchers reported in the Journal of the National 
Cancer Institute that quitting smoking is especially beneficial for those under 
age 49. In this American Cancer Society study, the risk (probability) of dying 
of lung cancer was about the same as for those who had never smoked. 


Note:For each of the word problems, use a solution sheet to do the hypothesis test. 
The solution sheet is found in 14. Appendix (online book version: the link is 
"Solution Sheets"; PDF book version: look under 14.5 Solution Sheets). Please feel 
free to make copies of the solution sheets. For the online version of the book, it is 
suggested that you copy the .doc or the .pdf files. 


Note:If you are using a student's-t distribution for a homework problem below, 
including for paired data, you may assume that the underlying population is 
normally distributed. (In general, you must first prove that assumption, though.) 


Exercise: 


Problem: 


A powder diet is tested on 49 people and a liquid diet is tested on 36 different 
people. Of interest is whether the liquid diet yields a higher mean weight loss 
than the powder diet. The powder diet group had an mean weight loss of 42 
pounds with a standard deviation of 12 pounds. The liquid diet group had an 
mean weight loss of 45 pounds with a standard deviation of 14 pounds. 


Solution: 


e d t6g.44 

e e-1.04 

e f0.1519 

e¢ hDecision: Do not reject null 


Exercise: 


Problem: 


The mean number of English courses taken in a two-year time period by male 
and female college students is believed to be about the same. An experiment is 
conducted and data are collected from 29 males and 16 females. The males 
took an average of 3 English courses with a standard deviation of 0.8. The 
females took an average of 4 English courses with a standard deviation of 1.0. 
Are the means statistically the same? 


Exercise: 


Problem: 


A study is done to determine if students in the California state university 
system take longer to graduate, on average, than students enrolled in private 
universities. 100 students from both the California state university system and 
private universities are surveyed. Suppose that from years of research, it is 
known that the population standard deviations are 1.5811 years and 1 year, 
respectively. The following data are collected. The California state university 
system students took on average 4.5 years with a standard deviation of 0.8. 
The private university students took on average 4.1 years with a standard 
deviation of 0.3. 


Solution: 


Standard Normal 


eez—2.14 
e £0.0163 
e hDecision: Reject null when a = 0.05; Do not reject null when a = 0.01 


Exercise: 
Problem: 
A new SAT study course is tested on 12 individuals. Pre-course and post- 


course scores are recorded. Of interest is the mean increase in SAT scores. The 
following data are collected: 


Pre-course score Post-course score 
1200 1300 
960 920 
1010 1100 
840 880 
1100 1070 
1250 1320 
860 860 
1330 1370 
790 770 
990 1040 


1110 1200 


740 850 


Exercise: 


Problem: 


A recent drug survey showed an increase in use of drugs and alcohol among 
local high school seniors as compared to the national percent. Suppose that a 
survey of 100 local seniors and 100 national seniors is conducted to see if the 
proportion of drug and alcohol use is higher locally than nationally. Locally, 65 
seniors reported using drugs or alcohol within the past month, while 60 
national seniors reported using them. 


Solution: 


¢ e0.73 
¢ £0.2326 
e hDecision: Do not reject null 


Exercise: 


Problem: 


A student at a four-year college claims that mean enrollment at four-year 
colleges is higher than at two-year colleges in the United States. Two surveys 
are conducted. Of the 35 two-year colleges surveyed, the mean enrollment 
was 5068 with a standard deviation of 4777. Of the 35 four-year colleges 
surveyed, the mean enrollment was 5466 with a standard deviation of 8191. 
(Source: Microsoft Bookshelf) 


Exercise: 


Problem: 


A study was conducted by the U.S. Army to see if applying antiperspirant to 
soldiers’ feet for a few days before a major hike would help cut down on the 
number of blisters soldiers had on their feet. In the experiment, for three nights 
before they went on a 13-mile hike, a group of 328 West Point cadets put an 
alcohol-based antiperspirant on their feet. A “control group” of 339 soldiers 
put on a similar, but inactive, preparation on their feet. On the day of the hike, 
the temperature reached 83° F. At the end of the hike, 21% of the soldiers who 
had used the antiperspirant and 48% of the control group had developed foot 
blisters. Conduct a hypothesis test to see if the proportion of soldiers using the 
antiperspirant was significantly lower than the control group. (Source: U.S. 
Army study reported in Journal of the American Academy of Dermatologists) 


Solution: 


e e-7.33 
e f0 
e hDecision: Reject null 


Exercise: 


Problem: 


We are interested in whether the proportions of female suicide victims for ages 
15 to 24 are the same for the white and the black races in the United States. We 
randomly pick one year, 1992, to compare the races. The number of suicides 
estimated in the United States in 1992 for white females is 4930. 580 were 
aged 15 to 24. The estimate for black females is 330. 40 were aged 15 to 24. 
We will let female suicide victims be our population. (Source: the National 
Center for Health Statistics, U.S. Dept. of Health and Human Services) 


Exercise: 


Problem: 


At Rachel’s 11th birthday party, 8 girls were timed to see how long (in 
seconds) they could hold their breath in a relaxed position. After a two-minute 
rest, they timed themselves while jumping. The girls thought that the mean 
difference between their jumping and relaxed times would be 0. Test their 
hypothesis. 


Relaxed time (seconds) Jumping time (seconds) 


26 21 

47 40 

30 28 

22 21 

23 25 

45 43 

37 35 

29 a2 
Solution: 

° dt7 

e e-1.51 

e £0.1755 


e hDecision: Do not reject null 


Exercise: 


Problem: 


Elizabeth Mjelde, an art history professor, was interested in whether the value 


larger+smaller dimension 
larger dimension 


Whitney Exhibit for works from 1900 — 1919 as for works from 1920 — 1942. 
37 early works were sampled. They averaged 1.74 with a standard deviation of 
0.11. 65 of the later works were sampled. They averaged 1.746 with a standard 
deviation of 0.1064. Do you think that there is a significant difference in the 
Golden Ratio calculation? (Source: data from Whitney Exhibit on loan to San 
Jose Museum of Art) 


from the Golden Ratio formula, ( ) was the same in the 


Exercise: 


Problem: 


One of the questions in a study of marital satisfaction of dual—career couples 
was to rate the statement, “I’m pleased with the way we divide the 
responsibilities for childcare.” The ratings went from 1 (strongly agree) to 5 
(strongly disagree). Below are ten of the paired responses for husbands and 
wives. Conduct a hypothesis test to see if the mean difference in the husband’s 
versus the wife’s satisfaction level is negative (meaning that, within the 
partnership, the husband is happier than the wife). 


Wife’s 
score 


Husband’s 
score 


Solution: 


e dtg 

e et = —1.86 

¢ £0.0479 

e hDecision: Reject null, but run another test 


Exercise: 
Problem: 
Ten individuals went on a low—fat diet for 12 weeks to lower their cholesterol. 


Evaluate the data below. Do you think that their cholesterol levels were 
significantly lowered? 


Starting cholesterol level Ending cholesterol level 


140 140 
220 230 
110 120 
240 220 
200 190 
180 150 
190 200 
360 300 
280 300 
260 240 
Exercise: 
Problem: 


Mean entry level salaries for college graduates with mechanical engineering 
degrees and electrical engineering degrees are believed to be approximately 
the same. (Source: http:// www. graduatingengineer.com). A recruiting office 
thinks that the mean mechanical engineering salary is actually lower than the 
mean electrical engineering salary. The recruiting office randomly surveys 50 
entry level mechanical engineers and 60 entry level electrical engineers. Their 
mean salaries were $46,100 and $46,700, respectively. Their standard 
deviations were $3450 and $4210, respectively. Conduct a hypothesis test to 
determine if you agree that the mean entry level mechanical engineering salary 
is lower than the mean entry level electrical engineering salary. 


Solution: 


e dtiog 


e ef = —0.82 
¢ f0.2066 
e hDecision: Do not reject null 


Exercise: 


Problem: 


A recent year was randomly picked from 1985 to the present. In that year, 
there were 2051 Hispanic students at Cabrillo College out of a total of 12,328 
students. At Lake Tahoe College, there were 321 Hispanic students out of a 
total of 2441 students. In general, do you think that the percent of Hispanic 
students at the two colleges is basically the same or different? (Source: 
Chancellor’s Office, California Community Colleges, November 1994) 


Exercise: 
Problem: 
Eight runners were convinced that the mean difference in their individual times 
for running one mile versus race walking one mile was at most 2 minutes. 


Below are their times. Do you agree that the mean difference is at most 2 
minutes? 


Running time (minutes) Race walking time (minutes) 
5.1 yf! 

5.6 9.2 

6.2 10.4 

4.8 6.9 

ipa 8.9 


4.2 9.5 


6.1 9.4 


4.4 79 


Solution: 


° dt7 

e et = 2.9850 

e £0.0102 

e hDecision: Reject null; There is sufficient evidence to conclude that the 
mean difference is more than 2 minutes. 


Exercise: 


Problem: 


Marketing companies have collected data implying that teenage girls use more 
ring tones on their cellular phones than teenage boys do. In one particular 
study of 40 randomly chosen teenage girls and boys (20 of each) with cellular 
phones, the mean number of ring tones for the girls was 3.2 with a standard 
deviation of 1.5. The mean for the boys was 1.7 with a standard deviation of 
0.8. Conduct a hypothesis test to determine if the means are approximately the 
same or if the girls’ mean is higher than the boys’ mean. 


Exercise: 


Problem: 


While her husband spent 2% hours picking out new speakers, a statistician 
decided to determine whether the percent of men who enjoy shopping for 
electronic equipment is higher than the percent of women who enjoy shopping 
for electronic equipment. The population was Saturday afternoon shoppers. 
Out of 67 men, 24 said they enjoyed the activity. 8 of the 24 women surveyed 
claimed to enjoy the activity. Interpret the results of the survey. 


Solution: 


e e0.22 
¢ £0.4133 
e hDecision: Do not reject null 


Exercise: 


Problem: 


We are interested in whether children’s educational computer software costs 
less, on average, than children’s entertainment software. 36 educational 
software titles were randomly picked from a catalog. The mean cost was 
$31.14 with a standard deviation of $4.69. 35 entertainment software titles 
were randomly picked from the same catalog. The mean cost was $33.86 with 
a standard deviation of $10.87. Decide whether children’s educational software 
costs less, on average, than children’s entertainment software. (Source: 
Educational Resources, December catalog) 


Exercise: 


Problem: 


Parents of teenage boys often complain that auto insurance costs more, on 
average, for teenage boys than for teenage girls. A group of concerned parents 
examines a random sample of insurance bills. The mean annual cost for 36 
teenage boys was $679. For 23 teenage girls, it was $559. From past years, it is 
known that the population standard deviation for each group is $180. 
Determine whether or not you believe that the mean cost for auto insurance for 
teenage boys is greater than that for teenage girls. 


Solution: 


°© ez= 2.50 
¢ £0.0063 
e hDecision: Reject null 


Exercise: 


Problem: 


A group of transfer bound students wondered if they will spend the same mean 
amount on texts and supplies each year at their four-year university as they 
have at their community college. They conducted a random survey of 54 
students at their community college and 66 students at their local four-year 
university. The sample means were $947 and $1011, respectively. The 
population standard deviations are known to be $254 and $87, respectively. 
Conduct a hypothesis test to determine if the means are statistically the same. 


Exercise: 


Problem: 


Joan Nguyen recently claimed that the proportion of college—age males with at 
least one pierced ear is as high as the proportion of college—age females. She 
conducted a survey in her classes. Out of 107 males, 20 had at least one 
pierced ear. Out of 92 females, 47 had at least one pierced ear. Do you believe 
that the proportion of males has reached the proportion of females? 


Solution: 


e e-4.82 
e f0 
e hDecision: Reject null 


Exercise: 


Problem: 


Some manufacturers claim that non-hybrid sedan cars have a lower mean 
miles per gallon (mpg) than hybrid ones. Suppose that consumers test 21 
hybrid sedans and get a mean of 31 mpg with a standard deviation of 7 mpg. 
Thirty-one non-hybrid sedans get a mean of 22 mpg with a standard deviation 
of 4 mpg. Suppose that the population standard deviations are known to be 6 
and 3, respectively. Conduct a hypothesis test to the manufacturers claim. 


Questions [link] — [link] refer to the Terri Vogel’s data set (see Table of Contents). 
Exercise: 


Problem: 


Using the data from Lap 1 only, conduct a hypothesis test to determine if the 
mean time for completing a lap in races is the same as it is in practices. 


Solution: 
¢ dt20.32 
e e-4.70 


¢ f0.0001 
e hDecision: Reject null 


Exercise: 


Problem: Repeat the test in [link], but use Lap 5 data this time. 
Exercise: 


Problem: 
Repeat the test in [link], but this time combine the data from Laps 1 and 5. 
Solution: 


e dt40.94 

e e-5.08 

e f0 

e hDecision: Reject null 


Exercise: 
Problem: 
In 2 — 3 complete sentences, explain in detail how you might use Terri Vogel’s 


data to answer the following question. “Does Terri Vogel drive faster in races 
than she does in practices?” 


Exercise: 


Problem: 


Is the proportion of race laps Terri completes slower than 130 seconds less 
than the proportion of practice laps she completes slower than 135 seconds? 


Solution: 
e e-0.9223 


e £0.1782 
e hDecision: Do not reject null 


Exercise: 


Problem:"To Breakfast or Not to Breakfast?" by Richard Ayore 


In the American society, birthdays are one of those days that everyone looks 
forward to. People of different ages and peer groups gather to mark the 18th, 


20th, ... birthdays. During this time, one looks back to see what he or she had 
achieved for the past year, and also focuses ahead for more to come. 


If, by any chance, I am invited to one of these parties, my experience is always 
different. Instead of dancing around with my friends while the music is 
booming, I get carried away by memories of my family back home in Kenya. I 
remember the good times I had with my brothers and sister while we did our 
daily routine. 


Every morning, I remember we went to the shamba (garden) to weed our 
crops. I remember one day arguing with my brother as to why he always 
remained behind just to join us an hour later. In his defense, he said that he 
preferred waiting for breakfast before he came to weed. He said, “This is why I 
always work more hours than you guys!” 


And so, to prove his wrong or right, we decided to give it a try. One day we 
went to work as usual without breakfast, and recorded the time we could work 
before getting tired and stopping. On the next day, we all ate breakfast before 
going to work. We recorded how long we worked again before getting tired 
and stopping. Of interest was our mean increase in work time. Though not 
sure, my brother insisted that it is more than two hours. Using the data below, 
solve our problem. 


Work hours with breakfast Work hours without breakfast 
8 6 
i 5 
9 5 
5 4 
9 7 


10 7 


7 fs) 
6 6 
9 fs) 


Try these multiple choice questions. 
For questions [link] — [link], use the following information. 


A new AIDS prevention drugs was tried on a group of 224 HIV positive patients. 
Forty-five (45) patients developed AIDS after four years. In a control group of 224 
HIV positive patients, 68 developed AIDS after four years. We want to test whether 
the method of treatment reduces the proportion of patients that develop AIDS after 
four years or if the proportions of the treated group and the untreated group stay the 
same. 


Let the subscript ¢= treated patient and ut= untreated patient. 
Exercise: 


Problem: The appropriate hypotheses are: 


© A Ao:p: < put and Ha:pt > Put 
¢ BAo:p; < pur and Ay:p; > put 
¢ C Hy:p: = pur and Ha:p: F# Put 
« D A 52p5 = Put and Hyp; = Put 


Solution: 
iD 
Exercise: 
Problem: If the p -value is 0.0062 what is the conclusion (use a = 0.05 )? 


e AThe method has no effect. 


¢ BThere is sufficient evidence to conclude that the method reduces the 
proportion of HIV positive patients that develop AIDS after four years. 

e CThere is sufficient evidence to conclude that the method increases the 
proportion of HIV positive patients that develop AIDS after four years. 

¢ DThere is insufficient evidence to conclude that the method reduces the 
proportion of HIV positive patients that develop AIDS after four years. 


Solution: 


B 
Exercise: 


Problem: 


Lesley E. Tan investigated the relationship between left-handedness and right- 
handedness and motor competence in preschool children. Random samples of 
A1 left-handers and 41 right-handers were given several tests of motor skills to 
determine if there is evidence of a difference between the children based on 
this experiment. The experiment produced the means and standard deviations 
shown below. Determine the appropriate test and best distribution to use for 
that test. 


Left-handed Right-handed 
Sample size Al 41 
Sample mean 97.5 98.1 
Sample standard deviation LS 19.2 


e ATwo independent means, normal distribution 

¢ BTwo independent means, student's-t distribution 

e CMatched or paired samples, student's-t distribution 
¢ DTwo population proportions, normal distribution 


Solution: 


B 


For questions [link] — [link], use the following information. 


An experiment is conducted to show that blood pressure can be consciously reduced 
in people trained in a “biofeedback exercise program.” Six (6) subjects were 
randomly selected and the blood pressure measurements were recorded before and 
after the training. The difference between blood pressures was calculated 

(after — before) producing the following results: xg = —10.2 sq = 8.4. Using 
the data, test the hypothesis that the blood pressure has decreased after the training, 
Exercise: 


Problem: The distribution for the test is 


e Ats 
e Bts 
¢ C N(-10.2,8.4) 


°D N(=10:2,77) 


Solution: 


A 


Exercise: 


Problem: If a = 0.05, the p-value and the conclusion are 


e A0.0014; There is sufficient evidence to conclude that the blood pressure 
decreased after the training 

¢ BO.0014; There is sufficient evidence to conclude that the blood pressure 
increased after the training 

¢ €0.0155; There is sufficient evidence to conclude that the blood pressure 
decreased after the training 

¢ DO0.0155; There is sufficient evidence to conclude that the blood pressure 
increased after the training 


Solution: 


Cc 


For questions [link |— [link], use the following information. 


The Eastern and Western Major League Soccer conferences have a new Reserve 
Division that allows new players to develop their skills. Data for a randomly picked 
date showed the following annual goals. 


Western Eastern 

Los Angeles 9 D.C. United 9 
FC Dallas 3 Chicago 8 
Chivas USA 4 Columbus 7 
Real Salt Lake 3 New England 6 
Colorado 4 MetroStars 5 
San Jose 4 Kansas City 3 


Conduct a hypothesis test to determine if the Western Reserve Division teams score, 
on average, fewer goals than the Eastern Reserve Division teams. Subscripts: 1 
Western Reserve Division (W); 2 Eastern Reserve Division (E) 

Exercise: 


Problem: The exact distribution for the hypothesis test is: 


e AThe normal distribution. 

e BThe student's-t distribution. 
e CThe uniform distribution. 

e DThe exponential distribution. 


Solution: 


B 


Exercise: 


Problem: If the level of significance is 0.05, the conclusion is: 


e AThere is sufficient evidence to conclude that the W Division teams 
score, on average, fewer goals than the E teams. 

e BThere is insufficient evidence to conclude that the W Division teams 
score, on average, more goals than the E teams. 

¢ CThere is insufficient evidence to conclude that the W teams score, on 

average, fewer goals than the E teams score. 

DUnable to determine. 


Solution: 


C 


Questions [link] — [link] refer to the following. 


Neuroinvasive West Nile virus refers to a severe disease that affects a person’s 
nervous system . It is spread by the Culex species of mosquito. In the United States 
in 2010 there were 629 reported cases of neuroinvasive West Nile virus out of a 
total of 1021 reported cases and there were 486 neuroinvasive reported cases out of 
a total of 712 cases reported in 2011. Is the 2011 proportion of neuroinvasive West 
Nile virus cases more than the 2010 proportion of neuroinvasive West Nile virus 
cases? Using a 1% level of significance, conduct an appropriate hypothesis test. 
(Source: http:// http:/Awww.cdc.gov/ncidod/dvbid/westnile/index.htm ) 


e “2011” subscript: 2011 group. 
e “2010” subscript: 2010 group 


Exercise: 


Problem: This is: 


e Aa test of two proportions 
¢ Ba test of two independent means 


¢ Ca test of a single mean 
¢ Da test of matched pairs. 


Solution: 


A 


Exercise: 


Problem: An appropriate null hypothesis is: 


¢ A poo11 < P2010 
¢ Bp2011 = P2010 
¢ C p2011 < 2010 
¢ D p2011 > P2010 


Solution: 


A 
Exercise: 


Problem: 


The p-value is 0.0022. At a 1% level of significance, the appropriate 
conclusion is 


e AThere is sufficient evidence to conclude that the proportion of people in 
the United States in 2011 that got neuroinvasive West Nile disease is less 
than the proportion of people in the United States in 2010 that got 
neuroinvasive West Nile disease. 

e BThere is insufficient evidence to conclude that the proportion of people 
in the United States in 2011 that got neuroinvasive West Nile disease is 
more than the proportion of people in the United States in 2010 that got 
neuroinvasive West Nile disease. 

e CThere is insufficient evidence to conclude that the proportion of people 
in the United States in 2011 that got neuroinvasive West Nile disease is 
less than the proportion of people in the United States in 2010 that got 
neuroinvasive West Nile disease. 

e¢ DtThere is sufficient evidence to conclude that the proportion of people in 
the United States in 2011 that got neuroinvasive West Nile disease is 


more than the proportion of people in the United States in 2010 that got 
neuroinvasive West Nile disease. 


Solution: 


D 


Questions [link] and [link] refer to the following: 


A golf instructor is interested in determining if her new technique for improving 
players’ golf scores is effective. She takes four (4) new students. She records their 
18-holes scores before learning the technique and then after having taken her class. 
She conducts a hypothesis test. The data are as follows. 


ial faa ae Player 4 
ci score before 83 78 93 87 
Mean score after class 80 80 86 86 
Exercise: 


Problem: This is: 


e Aa test of two independent means 
e Ba test of two proportions 

¢ Ca test of a single proportion 

¢ Da test of matched pairs. 


Solution: 


i) 


Exercise: 


Problem: The correct decision is: 


e AReject H, 
¢ BDo not reject H, 


Solution: 


B 


Questions [link] and [link] refer to the following: 


Suppose a Statistics instructor believes that there is no significant difference 
between the mean class scores of statistics day students on Exam 2 and Statistics 
night students on Exam 2. She takes random samples from each of the populations. 
The mean and standard deviation for 35 statistics day students were 75.86 and 
16.91. The mean and standard deviation for 37 statistics night students were 75.41 
and 19.73. The “day” subscript refers to the statistics day students. The “night” 
subscript refers to the statistics night students. 

Exercise: 


Problem: An appropriate alternate hypothesis for the hypothesis test is: 


° A Lday > Hnight 
° B Uday < Hnight 
eC Uday = Hnight 
° D Uday - Hnight 


Solution: 
D 
Exercise: 


Problem: A concluding statement is: 


e AThere is sufficient evidence to conclude that statistics night students 
mean on Exam 2 is better than the statistics day students mean on Exam 


Z: 

¢ BThere is insufficient evidence to conclude that the statistics day students 
mean on Exam 2 is better than the statistics night students mean on Exam 
ps 

e CThere is insufficient evidence to conclude that there is a significant 
difference between the means of the statistics day students and night 
students on Exam 2. 

¢ DThere is sufficient evidence to conclude that there is a significant 
difference between the means of the statistics day students and night 
students on Exam 2. 


Solution: 


C 


Review 
The next three questions refer to the following information: 


In a survey at Kirkwood Ski Resort the following information was 
recorded: 


0-10 11 - 20 21-40 40+ 
Ski 10 12 30 8 
Snowboard 6 17, 12 5 


Sport Participation by Age 


Suppose that one person from of the above was randomly selected. 
Exercise: 


Problem: 
Find the probability that the person was a skier or was age 11 — 20. 


Solution: 


17 
100 


Exercise: 


Problem: 


Find the probability that the person was a snowboarder given he/she 
was age 21 — 40. 


Solution: 


12 
42 


Exercise: 


Problem: Explain which of the following are true and which are false. 


e aSport and Age are independent events. 
e bSki and age 11 — 20 are mutually exclusive events. 
e c P(Skiand age 21 — 40) < P(Ski | age 21 — 40) 
ed 
P(Snowboard or age 0 — 10) < P(Snowboard | age 0 — 10) 


Solution: 


e aFalse 
e bFalse 
e clrue 
e dFalse 


Exercise: 
Problem: 
The average length of time a person with a broken leg wears a cast is 
approximately 6 weeks. The standard deviation is about 3 weeks. 
Thirty people who had recently healed from broken legs were 


interviewed. State the distribution that most accurately reflects total 
time to heal for the thirty people. 


Solution: 


N(180,16.43) 


Exercise: 


Problem: 


The distribution for _X is Uniform. What can we say for certain about 
the distribution for X when n = 1? 


e AThe distribution for X is still Uniform with the same mean and 
standard dev. as the distribution for X. 

e BThe distribution for X is Normal with the different mean and a 
different standard deviation as the distribution for X. 

e CThe distribution for X is Normal with the same mean but a 
larger standard deviation than the distribution for X. 

e DtThe distribution for X is Normal with the same mean but a 
smaller standard deviation than the distribution for X. 


Solution: 


A 
Exercise: 


Problem: 


The distribution for X is uniform. What can we say for certain about 
the distribution for }* X when n = 50? 


¢ AThe distribution for 5* Xis still uniform with the same mean 
and standard deviation as the distribution for X. 

¢ BtThe distribution for 5} X is Normal with the same mean but a 
larger standard deviation as the distribution for X. 

¢ CThe distribution for 5) X is Normal with a larger mean and a 
larger standard deviation than the distribution for X. 

¢ DThe distribution for 5) X is Normal with the same mean but a 
smaller standard deviation than the distribution for X. 


Solution: 


C 


The next three questions refer to the following information: 


A group of students measured the lengths of all the carrots in a five-pound 
bag of baby carrots. They calculated the average length of baby carrots to 
be 2.0 inches with a standard deviation of 0.25 inches. Suppose we 
randomly survey 16 five-pound bags of baby carrots. 

Exercise: 


Problem: 


State the approximate distribution for X, the distribution for the 
average lengths of baby carrots in 16 five-pound bags. X ~ 


Solution: 


_:25_ 
N(2,~=) 


Exercise: 


Problem: 


Explain why we cannot find the probability that one individual 
randomly chosen carrot is greater than 2.25 inches. 


Exercise: 
Problem: Find the probability that xz is between 2 and 2.25 inches. 


Solution: 


0.5000 


The next three questions refer to the following information: 


At the beginning of the term, the amount of time a student waits in line at 
the campus store is normally distributed with a mean of 5 minutes and a 
standard deviation of 2 minutes. 


Exercise: 


Problem: Find the 90th percentile of waiting time in minutes. 


Solution: 
7.6 


Exercise: 


Problem: Find the median waiting time for one student. 


Solution: 


5 
Exercise: 


Problem: 


Find the probability that the average waiting time for 40 students is at 
least 4.5 minutes. 


Solution: 


0.9431 


Hypothesis Testing of Two Means and Two Proportions: Lab I (edited: 
Teegarden) 

Labs changed to incorporate mini-tabs. 

Hypothesis Testing Lab -Two Samples 


Name: 


Student Learning Outcomes: 
* The student will select the appropriate distributions to use in each case. 


* The student will conduct hypothesis tests and interpret the results. 


House Prices 


Conduct a hypothesis test to determine if the proportion of homes in San 
Diego that cost more than $350,000 is different from the proportion of 
homes in Los Angeles that cost more than $350,000. Using either the 
newspaper’s housing section or a website, randomly choose 40 house prices 
from each city. Be sure to include a variety of areas. Choosing only homes 
in La Jolla will greatly bias your data. (a = 0.03) 


1.H, = H,= 

2. Summarize your sample statistics: 
San Diego: x = n= Los Angeles: x =n = 
3. The distribution to use for the test is: 


4. Using Minitab, draw a graph and label it appropriately. Shade the actual 
level of significance. Include the graph with this lab. 


5. What is the formula for calculating the test statistic? 


6. Calculate the test statistic and the p-value using Mintab, and record your 
values here and attach the session window to this lab. 


Test statistic = p-value = 


7. Do you reject or not reject the null hypothesis? Why (use a complete 
sentence? 


8. Write a clear conclusion using a complete sentence. 


Textbook Prices 


The data in Textbook.mtw shows both the new and used textbook price for 
a sample of books required for summer 2008 classes at Mesa College. Is it 
worthwhile to buy used textbooks? (a = 0.06) 


1.H, = H, = 

2. Summarize your sample statistics: 
New: x = n= Used: x =n= 

3. The distribution to use for the test is: 


4. Using Minitab, draw a graph and label it appropriately. Shade the actual 
level of significance. Include the graph with this lab. 


5. What is the formula for calculating the test statistic? 


6. Calculate the test statistic and the p-value using Mintab, and record your 
values here: 


Test statistic = p-value = 


7. Do you reject or not reject the null hypothesis? Why (use a complete 
sentence? 


8. Write a clear conclusion using a complete sentence. 


The Chi-Square Distribution 

This module provides an introduction to Chi-Square Distribution as a part 
of Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Interpret the chi-square probability distribution as the sample size 
changes. 

e Conduct and interpret chi-square goodness-of-fit hypothesis tests. 

e Conduct and interpret chi-square test of independence hypothesis tests. 

e¢ Conduct and interpret chi-square homogeneity hypothesis tests. 

e Conduct and interpret chi-square single variance hypothesis tests. 


Introduction 


Have you ever wondered if lottery numbers were evenly distributed or if 
some numbers occurred with a greater frequency? How about if the types of 
movies people preferred were different across different age groups? What 
about if a coffee machine was dispensing approximately the same amount 
of coffee each time? You could answer these questions by conducting a 
hypothesis test. 


You will now study a new distribution, one that is used to determine the 
answers to the above examples. This distribution is called the Chi-square 
distribution. 


In this chapter, you will learn the three major applications of the Chi-square 
distribution: 


e The goodness-of-fit test, which determines if data fit a particular 
distribution, such as with the lottery example 

e The test of independence, which determines if events are independent, 
such as with the movie example 


e The test of a single variance, which tests variability, such as with the 
coffee example 


Note: Though the Chi-square calculations depend on calculators or 
computers for most of the calculations, there is a table available (see the 
Table of Contents 15. Tables). TI-83+ and TI-84 calculator instructions are 
included in the text. 


Optional Collaborative Classroom Activity 


Look in the sports section of a newspaper or on the Internet for some sports 
data (baseball averages, basketball scores, golf tournament scores, football 
odds, swimming times, etc.). Plot a histogram and a boxplot using your 
data. See if you can determine a probability distribution that your data fits. 
Have a discussion with the class about your choice. 


Notation 

This module provides an overview of Chi-Square Distribution Notation as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


The notation for the chi-square distribution is: 
2 2 
ee i 


where df = degrees of freedom depend on how chi-square is being used. (If 
you want to practice calculating chi-square probabilities then use 

df = n — 1. The degrees of freedom for the three major uses are each 
calculated differently.) 


For the x? distribution, the population mean is jz = df and the population 
standard deviation is 0 = 2 - df. 


The random variable is shown as x” but may be any upper case letter. 


The random variable for a chi-square distribution with k degrees of freedom 
is the sum of k& independent, squared standard normal variables. 


x8 = (Za) + (Ze)? ++ (24)? 


Facts About the Chi-Square Distribution 


1. The curve is nonsymmetrical and skewed to the right. 
2. There is a different chi-square curve for each df. 


df = 2 df = 24 


3. The test statistic for any test is always greater than or equal to zero. 
4. When df > 90, the chi-square curve approximates the normal. For X 
~ X atid the mean, 2 = df = 1000 and the standard deviation, 


o = ¥2-1000 = 44.7. Therefore, X ~ N(1000, 44.7), 
approximately. 
5. The mean, p, is located just to the right of the peak. 


_/7 


vy 


In the next sections, you will learn about four different applications of the 
Chi-Square Distribution. These hypothesis tests are almost always right- 
tailed tests. In order to understand why the tests are mostly right-tailed, you 
will need to look carefully at the actual definition of the test statistic. Think 
about the following while you study the next four sections. If the expected 
and observed values are "far" apart, then the test statistic will be "large" and 
we will reject in the right tail. The only way to obtain a test statistic very 
close to zero, would be if the observed and expected values are very, very 
close to each other. A left-tailed test could be used to determine if the fit 
were "too good." A "too good" fit might occur if data had been manipulated 


or invented. Think about the implications of right-tailed versus left-tailed 
hypothesis tests as you learn the applications of the Chi-Square 
Distribution. 


Goodness-of-Fit Test 
This module describes how the chi-square distribution is used to conduct goodness-of-fit 
test. 


In this type of hypothesis test, you determine whether the data "fit" a particular 
distribution or not. For example, you may suspect your unknown data fit a binomial 
distribution. You use a chi-square test (meaning the distribution for the hypothesis test is 
chi-square) to determine if there is a fit or not. The null and the alternate hypotheses 
for this test may be written in sentences or may be stated as equations or 
inequalities. 


The test statistic for a goodness-of-fit test is: 
Equation: 


(0-5) 
KE 


where: 


e O = observed values (data) 
e F/= expected values (from theory) 
¢ k =the number of different data cells or categories 


The observed values are the data values and the expected values are the values you 
would expect to get if the null hypothesis were true. There are n terms of the form 
(O—E)’ 


7: 
The degrees of freedom are df = (number of categories - 1). 


The goodness-of-fit test is almost always right tailed. If the observed values and the 
corresponding expected values are not close to each other, then the test statistic can get 
very large and will be way out in the right tail of the chi-square curve. 


Note:The expected value for each cell needs to be at least 5 in order to use this test. 


Example: 

Absenteeism of college students from math classes is a major concern to math 
instructors because missing class appears to increase the drop rate. Suppose that a study 
was done to determine if the actual student absenteeism follows faculty perception. The 


faculty expected that a group of 100 students would miss class according to the 
following chart. 


Number absences per term Expected number of students 
0-2 50 

3-5 30 

6-8 f2 

9-11 6 

Dag 2 


A random survey across all mathematics courses was then done to determine the actual 
number (observed) of absences in a course. The next chart displays the result of that 
survey. 


Number absences per term Actual number of students 
0-2 35 

3-5 40 

6-8 20 

9-11 1 

1 2er 4 


Determine the null and alternate hypotheses needed to conduct a goodness-of-fit test. 
H,: Student absenteeism fits faculty perception. 


The alternate hypothesis is the opposite of the null hypothesis. 
H,: Student absenteeism does not fit faculty perception. 


Exercise: 


Problem: 


Can you use the information as it appears in the charts to conduct the goodness-of- 


fit test? 


Solution: 


No. Notice that the expected number of absences for the "12+" entry is less than 5 
(it is 2). Combine that group with the "9 - 11" group to create new tables where the 
number of students for each entry are at least 5. The new tables are below. 


Number absences per term 


Number absences per term 


Expected number of students 
50 
30 


12 


Actual number of students 
35 
40 


20 


Exercise: 


Problem: What are the degrees of freedom (df)? 
Solution: 


There are 4 "cells" or categories in each of the new tables. 


Gic— sin bet Oivce llc 10 —.4-- ale 95 


Example: 

Employers particularly want to know which days of the week employees are absent in a 
five day work week. Most employers would like to believe that employees are absent 
equally during the week. Suppose a random sample of 60 managers were asked on 
which day of the week did they have the highest number of employee absences. The 
results were distributed as follows: 


Monday Tuesday Wednesday Thursday Friday 
Number 


of 15 12 9 9 15 
Absences 


Day of the Week Employees were most Absent 


Exercise: 
Problem: 
For the population of employees, do the days for the highest number of absences 
occur with equal frequencies during a five day work week? Test at a 5% 
significance level. 


Solution: 


The null and alternate hypotheses are: 


¢ H,: The absent days occur with equal frequencies, that is, they fit a uniform 
distribution. 

e H,: The absent days occur with unequal frequencies, that is, they do not fit a 
uniform distribution. 


If the absent days occur with equal frequencies, then, out of 60 absent days (the 
total in the sample: 15 + 12 +9 +9 + 15 = 60), there would be 12 absences on 
Monday, 12 on Tuesday, 12 on Wednesday, 12 on Thursday, and 12 on Friday. 
These numbers are the expected (/) values. The values in the table are the 
observed (QO) values or data. 


This time, calculate the x? test statistic by hand. Make a chart with the following 
headings and fill in the columns: 


e Expected (£) values (12, 12, 12, 12, 12) 
e Observed (Q) values (15, 12, 9, 9, 15) 
(C=) 
° (O-E)’ 
(O-E)’ 
The last column (-—;-— ) should have 0.75, 0, 0.75, 0.75, 0.75. 


Now add (sum) the last column. Verify that the sum is 3. This is the x test 
statistic. 


To find the p-value, calculate P (x? = OP This test is right-tailed. 


(Use a computer or calculator to find the p-value. You should get 
p-value = 0.5578.) 


The disare the number of cells. — 1 — 5. I — 4) 


TI-83+ and TI-84: Press 2nd DISTR. Arrow down to x2cdf. Press ENTER. 
Enter (3, 10499, 4). Rounded to 4 decimal places, you should see 0.5578 which 
is the p-value. 


Next, complete a graph like the one below with the proper labeling and shading. 
(You should shade the right tail.) 


= ao 


x 


The decision is to not reject the null hypothesis. 
Conclusion: At a 5% level of significance, from the sample data, there is not 


sufficient evidence to conclude that the absent days do not occur with equal 
frequencies. 


Note:TI-83+ and some TI-84 calculators do not have a special program for the test 
Statistic for the goodness-of-fit test. The next example (Example 11-3) has the 
calculator instructions. The newer TI-84 calculators have in 


STAT TESTS 
the test 


Chi2 GOF 


. To run the test, put the observed values (the data) into a first list and the expected 
values (the values you expect if the null hypothesis is true) into a second list. Press 


STAT 
TESTS 

and 

Chi2 GOF 


. Enter the list names for the Observed list and the Expected list. Enter the degrees 
of freedom and press 


calculate 
or 
draw 


. Make sure you clear any lists before you start. See below. 


Note:To Clear Lists in the calculators: Go into 


STAT EDIT 


and arrow up to the list name area of the particular list. Press 


CLEAR 


and then arrow down. The list will be cleared. Or, you can press 


STAT 
and press 4 (for 


ClrList 


). Enter the list name and press 


ENTER 


Example: 


One study indicates that the number of televisions that American families have is 
distributed (this is the given distribution for the American population) as follows: 


Number of Televisions 


0 


over 3 


Percent 


10 


16 


55 


11 


The table contains expected (£) percents. 
A random sample of 600 families in the far western United States resulted in the 
following data: 


Number of Televisions Frequency 
0 66 
1 119 
2 340 
3 60 
over 3 bs 
Total = 600 


The table contains observed (O) frequency values. 
Exercise: 


Problem: 
At the 1% significance level, does it appear that the distribution "number of 


televisions" of far western United States families is different from the distribution 
for the American population as a whole? 


Solution: 
This problem asks you to test whether the far western United States families 
distribution fits the distribution of the American families. This test is always right- 


tailed. 


The first table contains expected percentages. To get expected (£) frequencies, 
multiply the percentage by 600. The expected frequencies are: 


Number of Televisions Percent Expected Frequency 


0 10 (0.10) - (600) = 60 
1 16 (0.16) - (600) = 96 
2 55 (0.55) - (600) = 330 
3 6 (0.11) - (600) = 66 
over 3 8 (0.08) - (600) = 48 


Therefore, the expected frequencies are 60, 96, 330, 66, and 48. In the TI 
calculators, you can let the calculator do the math. For example, instead of 60, enter 
.10*600. 


H,: The "number of televisions" distribution of far western United States families 
is the same as the "number of televisions" distribution of the American population. 


H,: The "number of televisions" distribution of far western United States families 
is different from the "number of televisions" distribution of the American 


population. 


Distribution for the test: x? where df = (the number of cells) —1 =5—1= 4. 


Note:df 4 600 — 1 


Calculate the test statistic: x? = 29.65 


Graph: 


p-value = 0.000006 (almost 0) 


Probability statement: p-value = P(x” > 29.65) = 0.000006. 


Compare a and the p-value: 


sa — 0 01 
e p-value = 0.000006 


So, a > p-value. 
Make a decision: Since a > p-value, reject Ho. 


This means you reject the belief that the distribution for the far western states is the 
same as that of the American population as a whole. 


Conclusion: At the 1% significance level, from the data, there is sufficient 
evidence to conclude that the "number of televisions" distribution for the far 
western United States is different from the "number of televisions" distribution for 
the American population as a whole. 


Note: TI-83+ and some TI-84 calculators: Press 
STAT 

and 

ENTER 

. Make sure to clear lists 


Li 


L2 

, and 

L3 

if they have data in them (see the note at the end of Example 11-2). Into 
L1 

, put the observed frequencies 


66 


, put the expected frequencies 


.10*600, .16*600 


’ 


.55*600 


’ 


.11*600 


.08*600 

. Arrow over to list 

L3 

and up to the name area 
mp gn 

. Enter 
(L1-L2)42/L2 

and 

ENTER 

. Press 

2nd QUIT 

. Press 

2nd LIST 

and arrow over to 
MATH 

. Press 

5 

. You should see 
"sum" (Enter L3) 
. Rounded to 2 decimal places, you should see 
29.65 

. Press 

2nd DISTR 


. Press 


7 
or Arrow down to 

fi X2COT. 

and press 

ENTER 

. Enter 

(29.65,1E99, 4) 

. Rounded to 4 places, you should see 

5.77E-6 = .000006 

(rounded to 6 decimal places) which is the p-value. 
The newer TI-84 calculators have in 

STAT TESTS 

the test 

Chi2 GOF 


. To run the test, put the observed values (the data) into a first list and the expected 
values (the values you expect if the null hypothesis is true) into a second list. Press 


STAT 
TESTS 

and 

Chi2 GOF 


. Enter the list names for the Observed list and the Expected list. Enter the degrees 
of freedom and press 


calculate 


or 


draw 


. Make sure you clear any lists before you start. 


Example: 
Exercise: 


Problem: 


Suppose you flip two coins 100 times. The results are 20 HH, 27 HT, 30 TH, and 
23 TT. Are the coins fair? Test at a 5% significance level. 


Solution: 


This problem can be set up as a goodness-of-fit problem. The sample space for 
flipping two fair coins is {HH, HT, TH, TT}. Out of 100 flips, you would expect 25 
HH, 25 HT, 25 TH, and 25 TT. This is the expected distribution. The question, "Are 
the coins fair?" is the same as saying, "Does the distribution of the coins (20 HH, 
27 HT, 30 TH, 23 TT) fit the expected distribution?" 


Random Variable: Let X = the number of heads in one flip of the two coins. X 
takes on the value 0, 1, 2. (There are 0, 1, or 2 heads in the flip of 2 coins.) 
Therefore, the number of cells is 3. Since X = the number of heads, the observed 
frequencies are 20 (for 2 heads), 57 (for 1 head), and 23 (for 0 heads or both tails). 
The expected frequencies are 25 (for 2 heads), 50 (for 1 head), and 25 (for 0 heads 
or both tails). This test is right-tailed. 


H,: The coins are fair. 

H,: The coins are not fair. 

Distribution for the test: y3 where df = 3 — 1 = 2. 
Calculate the test statistic: x? = 2.14 


Graph: 


p-value = 0.3430 


Probability statement: p-value = P (x? = 2.14) = 0.3430 


Compare a and the p-value: 


°° a=0.05 
¢ p-value = 0.3430 


So, a < p-value. 
Make a decision: Since a < p-value, do not reject Ho. 


Conclusion: There is insufficient evidence to conclude that the coins are not fair. 


Note: TI-83+ and some TI- 84 calculators: Press 
STAT 

and 

ENTER 

. Make sure you clear lists 


Li 


if they have data in them. Into 
L1 
, put the observed frequencies 


20 


, put the expected frequencies 


25 


. Arrow over to list 

L3 

and up to the name area 
mpg 

. Enter 
(L1-L2)A2/L2 

and 


ENTER 


. Press 

2nd QUIT 

. Press 

2nd LIST 

and arrow over to 
MATH 

. Press 

5 

. You should see 


"sum" 


Enter L3 

. Rounded to 2 decimal places, you should see 
2.14 

. Press 

2nd DISTR 

. Arrow down to 
7:x2cdf 

(or press 

7 

). Press 

ENTER 

. Enter 


2.14,1E99,2) 


. Rounded to 4 places, you should see 
. 3430 

which is the p-value. 

The newer TI-84 calculators have in 
STAT TESTS 

the test 

Chi2 GOF 


. To run the test, put the observed values (the data) into a first list and the expected 
values (the values you expect if the null hypothesis is true) into a second list. Press 


STAT 
TESTS 

and 

Chi2 GOF 


. Enter the list names for the Observed list and the Expected list. Enter the degrees 
of freedom and press 


calculate 
or 
draw 


. Make sure you clear any lists before you start. 


Test of Independence 
This module describes how the chi-square distribution can be used to test for independence. 


Tests of independence involve using a contingency table of observed (data) values. You first 
saw a contingency table when you studied probability in the Probability Topics chapter. 


The test statistic for a test of independence is similar to that of a goodness-of-fit test: 
Equation: 


(O— E)? 
(ij) £E 


where: 


¢ O= observed values 

e F = expected values 

e 2 =the number of rows in the table 

e 7 =the number of columns in the table 


There are z - 7 terms of the form oa, 


A test of independence determines whether two factors are independent or not. You first 
encountered the term independence in Chapter 3. As a review, consider the following 
example. 


Note:The expected value for each cell needs to be at least 5 in order to use this test. 


Example: 

Suppose A = a speeding violation in the last year and B = a cell phone user while driving. If 
A and B are independent then P(A AND B) = P(A)P(B). A AND Bis the event that a 
driver received a speeding violation last year and is also a cell phone user while driving. 
Suppose, in a study of drivers who received speeding violations in the last year and who uses 
cell phones while driving, that 755 people were surveyed. Out of the 755, 70 had a speeding 
violation and 685 did not; 305 were cell phone users while driving and 450 were not. 

Let y = expected number of drivers that use a cell phone while driving and received speeding 
violations. 


If A and B are independent, then P(A AND B) = P(A)P(B). By substitution, 
y _ 70 , 305 


755 755 | 75D 


Solve for y: y= BS = 28.3 


About 28 people from the sample are expected to be cell phone users while driving and to 
receive speeding violations. 


In a test of independence, we state the null and alternate hypotheses in words. Since the 
contingency table consists of two factors, the null hypothesis states that the factors are 
independent and the alternate hypothesis states that they are not independent (dependent). 
If we do a test of independence using the example above, then the null hypothesis is: 

H,: Being a cell phone user while driving and receiving a speeding violation are independent 
events. 

If the null hypothesis were true, we would expect about 28 people to be cell phone users 
while driving and to receive a speeding violation. 

The test of independence is always right-tailed because of the calculation of the test 
Statistic. If the expected and observed values are not close together, then the test statistic is 
very large and way out in the right tail of the chi-square curve, like goodness-of-fit. 

The degrees of freedom for the test of independence are: 

df = (number of columns - 1)(number of rows - 1) 


The following formula calculates the expected number (£): 
ee (row total)(column total) 
~ total number surveyed 


Example: 

In a volunteer group, adults 21 and older volunteer from one to nine hours each week to 
spend time with a disabled senior citizen. The program recruits among community college 
students, four-year college students, and nonstudents. The following table is a sample of the 
adult volunteers and the number of hours they volunteer per week. 


1-3 4-6 7-9 Row 
Type of Volunteer Hours Hours Hours Total 
Community College 1 96 48 955 
Students 
Four-Year College 96 133 61 290 
Students 
Nonstudents 91 150 53 294 
Column Total 298 379 162 839 


Number of Hours Worked Per Week by Volunteer Type (Observed)The table contains 
observed (O) values (data). 


Exercise: 


Problem: Are the number of hours volunteered independent of the type of volunteer? 
Solution: 


The observed table and the question at the end of the problem, "Are the number of 
hours volunteered independent of the type of volunteer?" tell you this is a test of 
independence. The two factors are number of hours volunteered and type of 
volunteer. This test is always right-tailed. 


H,: The number of hours volunteered is independent of the type of volunteer. 
H,: The number of hours volunteered is dependent on the type of volunteer. 


The expected table is: 


Type of Volunteer 1-3 Hours 4-6 Hours 7-9 Hours 
Community College Students 90.57 1509 49.24 
Four-Year College Students 103.00 131.00 56.00 
Nonstudents 104.42 132.81 56.77 


Number of Hours Worked Per Week by Volunteer Type (Expected)The table contains 
expected (/) values (data). 


For example, the calculation for the expected frequency for the top left cell is 


__ (row total)(column total) 255.298 __ 
i eimnbecimesede eae 


Calculate the test statistic: x” = 12.99 (calculator or computer) 
Distribution for the test: x3 
df = (3 columns — 1)(3 rows — 1) = (2)(2) =4 


Graph: 


p-value = 0.0113 


J+ 


0 12.99 


Probability statement: p-value = P(x” > 12.99) = 0.0113 


Compare a and the p-value: Since no a is given, assume a = 0.05. 
p-value = 0.0113. a > p-value. 


Make a decision: Since a > p-value, reject H,. This means that the factors are not 
independent. 


Conclusion: At a 5% level of significance, from the data, there is sufficient evidence to 
conclude that the number of hours volunteered and the type of volunteer are dependent 
on one another. 


For the above example, if there had been another type of volunteer, teenagers, what 
would the degrees of freedom be? 


Note:Calculator instructions follow. 


TI-83+ and TI-84 calculator: Press the MATRX key and arrow over to EDIT. Press 1: 
[A]. Press 3 ENTER 3 ENTER. Enter the table values by row from Example 11-6. 
Press ENTER after each. Press 2nd QUIT. Press STAT and arrow over to TESTS. 
Arrow down to C: x2-TEST. Press ENTER. You should see ObServed:[A] and 
Expected: [B]. Arrow down to Calculate. Press ENTER. The test statistic is 
12.9909 and the p-value = 0.0113. Do the procedure a second time but arrow down to 
Draw instead of calculate. 


Example: 

De Anza College is interested in the relationship between anxiety level and the need to 
succeed in school. A random sample of 400 students took a test that measured anxiety level 
and need to succeed in school. The table shows the results. De Anza College wants to know 
if anxiety level and need to succeed in school are independent events. 


Need to 


Succeed Med- Med- 

in High high Medium low Low Row 
School Anxiety Anxiety Anxiety Anxiety Anxiety Total 
“gH 35 42 53 15 10 155 
Need 

ee 48 63 33 a 193 
Need 

Low 

ikea 4 5 11 15 17 52 
CIP 95 127 63 58 400 
Total 


Need to Succeed in School vs. Anxiety Level 


Exercise: 


Problem: 


How many high anxiety level students are expected to have a high need to succeed in 
school? 


Solution: 


The column total for a high anxiety level is 57. The row total for high need to succeed 
in school is 155. The sample size or total surveyed is 400. 


__ (row total)(column total) 155-57 __ 
B= “totalsurveyed §§ 400 22.09 


The expected number of students who have a high anxiety level and a high need to 
succeed in school is about 22. 

Exercise: 
Problem: 


If the two variables are independent, how many students do you expect to have a low 
need to succeed in school and a med-low level of anxiety? 


Solution: 


The column total for a med-low anxiety level is 63. The row total for a low need to 
succeed in school is 52. The sample size or total surveyed is 400. 


Exercise: 


Problem: 


__ (row total)(column total) _ 
a ae total surveyed aa 


e b The expected number of students who have a med-low anxiety level and a 
low need to succeed in school is about: 


Solution: 


__ (row total)(column total) —__ 
Ce total surveyed = 8.19 
 b8 


Glossary 


Contingency Table 
The method of displaying a frequency distribution as a table with rows and columns to 
show how two variables may be dependent (contingent) upon each other. The table 
provides an easy way to calculate conditional probabilities. 


Test of a Single Variance (Optional) 

This module provides an overview on Chi-Square Distribution Tests of 
Variance as a part of Collaborative Statistics collection (col10522) by 
Barbara Illowsky and Susan Dean. 


A test of a single variance assumes that the underlying distribution is 
normal. The null and alternate hypotheses are stated in terms of the 
population variance (or population standard deviation). The test statistic 
is: 

Equation: 


(n —1)-s? 


o2 


where: 


e n= the total number of data 
e s” =sample variance 
eo = population variance 


You may think of s as the random variable in this test. The degrees of 
freedom are df = n — 1. 


A test of a single variance may be right-tailed, left-tailed, or two-tailed. 


The following example will show you how to set up the null and alternate 
hypotheses. The null and alternate hypotheses contain statements about the 
population variance. 


Example: 
Exercise: 


Problem: 


Math instructors are not only interested in how their students do on 
exams, on average, but how the exam scores vary. To many 
instructors, the variance (or standard deviation) may be more 
important than the average. 


Suppose a math instructor believes that the standard deviation for his 
final exam is 5 points. One of his best students thinks otherwise. The 
student claims that the standard deviation is more than 5 points. If the 
student were to conduct a hypothesis test, what would the null and 
alternate hypotheses be? 


Solution: 


Even though we are given the population standard deviation, we can 
set the test up using the population variance as follows. 


(ela or == ae 
¢ H,:07 > 5? 
Example: 
Exercise: 
Problem: 


With individual lines at its various windows, a post office finds that 
the standard deviation for normally distributed waiting times for 
customers on Friday afternoon is 7.2 minutes. The post office 
experiments with a single main waiting line and finds that for a 
random sample of 25 customers, the waiting times for customers have 
a standard deviation of 3.5 minutes. 


With a significance level of 5%, test the claim that a single line 
causes lower variation among waiting times (shorter waiting 


times) for customers. 
Solution: 


Since the claim is that a single line causes lower variation, this is a 
test of a single variance. The parameter is the population variance, 7, 
or the population standard deviation, o. 


Random Variable: The sample standard deviation, s, is the random 
variable. Let s = standard deviation for the waiting times. 


5 fle oe =e 
e Hi: o2<7.2" 


The word "lower" tells you this is a left-tailed test. 
Distribution for the test: eG 4» Where: 


e mn =the number of customers sampled 
«dfi—n—1=25 l= 24 


Calculate the test statistic: 


— 32 — cian 2 


wheres? = 25, s = 3.5,ando = /.2. 
Graph: 


p-value = 0.000042 


0 5.67 


Probability statement: p-value = lena =< 5.67) = 0.000042 


Compare qa and the p-value: 
a = 0.05 p-value = 0.000042 a > p-value 


Make a decision: Since a > p-value, reject Ho. 


This means that you reject o? = 7.27. In other words, you do not 
think the variation in waiting times is 7.2 minutes, but lower. 


Conclusion: At a 5% level of significance, from the data, there is 
sufficient evidence to conclude that a single line causes a lower 
variation among the waiting times or with a single line, the customer 
waiting times vary less than 7.2 minutes. 


TI-83+ and TI-84 calculators: In 2nd DISTR, use 7: x2cdf. The 
syntax is (lower, upper, df) for the parameter list. For 
Example 11-9, x2cdf (-1E99,5.67, 24). The 

p-value = 0.000042. 


Summary of Formulas 

This module provides a summary on formulas used in Chi-Square 
Distribution as a part of Collaborative Statistics collection (col10522) by 
Barbara Illowsky and Susan Dean. 


The Chi-Square Probability Distribution 
jojdi and: ¢ = /2-~- di 


Goodness-of-Fit Hypothesis Test 


e Use goodness-of-fit to test whether a data set fits a particular 
probability distribution. 


e The degrees of freedom are number of cells or categories - 1. 


oC» , where O = observed values (data), E = 


e The test statistic is 3 


expected values (from theory), and k = the number of different data 
cells or categories. 
e The test is right-tailed. 


Test of Independence 


e Use the test of independence to test whether two factors are 
independent or not. 

e The degrees of freedom are equal to 
(number of columns - 1)(number of rows - 1). 
_ 2 

e The test statistic is a so-ey where O = observed values, EF’ = 

uJ 

expected values, 2 = the number of rows in the table, and 7 = the 
number of columns in the table. 

e The test is right-tailed. 


e If the null hypothesis is true, the expected number 
ne (row total)(column total) 
~ ——— totalsurveyed 


Test of Homogeneity 


e Use the test for homogeneity to decide if two populations with 
unknown distributions have the same distribution as each other. 
e The degrees of freedom are equal to number of columns - 1. 


cosa —E) 
e The test statistic is »’ so-ey where O = observed values, F = 
ud 
expected values, 2 = the number of rows in the table, and 7 = the 
number of columns in the table. 
e The test is right-tailed. 


e If the null hypothesis is true, the expected number 
Be (row total) (column total) 
~— — totalsurveyed 


Note:The expected value for each cell needs to be at least 5 in order to use 
the Goodness-of-Fit, Independence and Homogeneity tests. 


Test of a Single Variance 


e Use the test to determine variation. 
e The degrees of freedom are the number of samples - 1. 


(n—1) -s? 


e The test statistic is 4 where n = the total number of data, s? = 


sample variance, and o” = population variance. 
e The test may be left, right, or two-tailed. 


Practice 1: Goodness-of-Fit Test 

This module provides a practice on Chi-Square Distribution as a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 


Student Learning Outcomes 


e The student will conduct a goodness-of-fit test. 


Given 


The following data are real. The cumulative number of AIDS cases reported 
for Santa Clara County is broken down by ethnicity as follows: (Source: 
HIV/AIDS Epidemiology Santa Clara County, Santa Clara County Public 
Health Department, May 2011) 


Ethnicity Number of Cases 
White 2229 
Hispanic 1157 
Black/African-American 457 
Asian, Pacific Islander 232 

Total = 4075 


The percentage of each ethnic group in Santa Clara County is as follows: 


Ethnicity 
White 
Hispanic 


Black/African- 
American 


Asian, Pacific 
Islander 


Expected Results 


If the ethnicity of AIDS victims followed the ethnicity of the total county 


Percentage of total 
county population 


42.9% 


26.7% 


2.6% 


27.8% 


Total = 100% 


Number expected 
(round to 2 decimal 
places) 


1748.18 


population, fill in the expected number of cases per ethnic group. 


Goodness-of-Fit Test 


Perform a goodness-of-fit test to determine whether the make-up of AIDS 
cases follows the ethnicity of the general population of Santa Clara County. 


Exercise: 


Problem: H,: 


Exercise: 


Problem: H,: 


Exercise: 


Problem: Is this a right-tailed, left-tailed, or two-tailed test? 


Exercise: 


Problem: degrees of freedom = 


Solution: 


degrees of freedom = 3 


Exercise: 


Problem: Chi’ test statistic = 


Solution: 


2016.14 


Exercise: 


Problem: p-value = 


Solution: 


Rounded to 4 decimal places, the p-value is 0.0000. 
Exercise: 
Problem: 
Graph the situation. Label and scale the horizontal axis. Mark the 


mean and test statistic. Shade in the region corresponding to the p- 
value. 


Let a = 0.05 
Decision: 
Reason for the Decision: 


Conclusion (write out in complete sentences): 


Discussion Question 


Exercise: 
Problem: 
Does it appear that the pattern of AIDS cases in Santa Clara County 


corresponds to the distribution of ethnic groups in this county? Why or 
why not? 


Practice 2: Contingency Tables 
This module provides a practice on Chi-Square Distribution as a part of Collaborative Statistics 
collection (col10522) by Barbara Illowsky and Susan Dean. 


Student Learning Outcomes 
¢ The student will conduct a test for independence using contingency tables. 


Conduct a hypothesis test to determine if smoking level and ethnicity are independent. 


Collect the Data 


Copy the data provided in Probability Topics Practice 1: Contingency Tables into the table below. 


Smoking 

Level African Native Japanese 

Per Day American Hawaiian Latino Americans White TOTALS 
1-10 

11-20 

21-30 

31+ 

TOTALS 


Smoking Levels by Ethnicity (Observed) 


Hypothesis 
State the hypotheses. 


©. 1137 
°e Ay: 


Expected Values 


Enter expected values in the above below. Round to two decimal places. 


Analyze the Data 


Calculate the following values: 


Exercise: 


Problem: Degrees of freedom = 


Solution: 
12 


Exercise: 


Problem: Chi’ test statistic = 


Solution: 
10301.8 
Exercise: 
Problem: p-value = 


Solution: 
0 


Exercise: 


Problem: Is this a right-tailed, left-tailed, or two-tailed test? Explain why. 


Solution: 


right 


Graph the Data 


Exercise: 


Problem: 


Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade in 
the region corresponding to the p-value. 


Conclusions 


State the decision and conclusion (in a complete sentence) for the following preconceived levels of a . 
Exercise: 


Problem: a = 0.05 
e a Decision: 
e b Reason for the decision: 
e c Conclusion (write out in a complete sentence): 


Solution: 


e aReject the null hypothesis 
Exercise: 


Problem: a = 0.01 


e a Decision: 
e b Reason for the decision: 
e c Conclusion (write out in a complete sentence): 


Practice 3: Test of a Single Variance 
This module provides a practice on Chi-Square Distribution as a part of 
Elementary Statistics textbook. 


Student Learning Outcomes 


e The student will conduct a test of a single variance. 


Given 


Suppose an airline claims that its flights are consistently on time with an 
average delay of at most 15 minutes. It claims that the average delay is so 
consistent that the variance is no more than 150 minutes. Doubting the 
consistency part of the claim, a disgruntled traveler calculates the delays for 
his next 25 flights. The average delay for those 25 flights is 22 minutes with 
a standard deviation of 15 minutes. 


Sample Variance 


Exercise: 


Problem: 


Is the traveler disputing the claim about the average or about the 
variance? 


Exercise: 


Problem: 


A sample standard deviation of 15 minutes is the same as a sample 
variance of minutes. 


Solution: 


220 


Exercise: 


Problem 


: Is this a right-tailed, left-tailed, or two-tailed test? 


Hypothesis Test 


Perform ah 
Exercise: 


Problem: 


Exercise: 


Problem: 


Exercise: 


Problem 


ypothesis test on the consistency part of the claim. 


Has 


ys 


: Degrees of freedom = 


Solution: 


24 


Exercise: 


Problem 


: Chi’ test statistic = 


Solution: 


36 


Exercise: 


Problem 


: p-value = 


Solution: 


0.0549 


Exercise: 


Problem: 


Graph the situation. Label and scale the horizontal axis. Mark the 
mean and test statistic. Shade the p-value. 


Exercise: 


Problem: Let a = 0.05 
Decision: 


Conclusion (write out in a complete sentence): 


Discussion Questions 


Exercise: 


Problem: How did you know to test the variance instead of the mean? 
Exercise: 

Problem: 

If an additional test were done on the claim of the average delay, which 

distribution would you use? 


Exercise: 


Problem: 


If an additional test was done on the claim of the average delay, but 45 
flights were surveyed, which distribution would you use? 


Homework 

This module provides homework on Chi-Square Distribution as a part of Collaborative 
Statistics collection (col10522) by Barbara Illowsky and Susan Dean. 

Exercise: 


Problem: 


e aExplain why the “goodness of fit” test and the “test for independence” are 
generally right tailed tests. 
e blIf you did a left-tailed test, what would you be testing? 


Word Problems 


For each word problem, use a solution sheet to solve the hypothesis test problem. Go to The 
Table of Contents 14. Appendix for the chi-square solution sheet. Round expected frequency 
to two decimal places. 

Exercise: 


Problem: 
A 6-sided die is rolled 120 times. Fill in the expected frequency column. Then, conduct a 


hypothesis test to determine if the die is fair. The data below are the result of the 120 
rolls. 


Face Value Frequency Expected Frequency 
1 15 
2 29 
3 16 
4 15 
5 30 
6 15 


Exercise: 


Problem: 


The marital status distribution of the U.S. male population, age 15 and older, is as shown 
below. (Source: U.S. Census Bureau, Current Population Reports) 


Marital Status 
never married 
married 
widowed 


divorced/separated 


Percent Expected Frequency 
31.3 

56.1 

2.5 


10.1 


Suppose that a random sample of 400 U.S. young adult males, 18 — 24 years old, yielded 
the following frequency distribution. We are interested in whether this age group of males 
fits the distribution of the U.S. adult population. Calculate the frequency one would 
expect when surveying 400 people. Fill in the above table, rounding to two decimal 


places. 


Marital Status 
never married 
married 
widowed 


divorced/separated 


Solution: 


e aThe data fits the distribution 


Frequency 


140 


¢ bThe data does not fit the distribution 

° 63 

e e19.27 

¢ f0.0002 

e hDecision: Reject Null; Conclusion: Data does not fit the distribution. 


The next two questions refer to the following information. The columns in the chart below 
contain the Race/Ethnicity of U.S. Public Schools for a recent year, the percentages for the 
Advanced Placement Examinee Population for that class and the Overall Student Population. 
(Source: http://www.collegeboard.com). Suppose the right column contains the result of a 
survey of 1000 local students from that year who took an AP Exam. 


AP Examinee Overall Student Survey 

Race/Ethnicity Population Population Frequency 

Asian, Asian American or 10.2% 5.A% 113 

Pacific Islander 

Black or African 8.2% 14.5% 9A 

American 

Hispanic or Latino 15.5% 15.9% 136 

American Indian or . 2 

Alaska Native te ee ) 

White 59.4% 61.6% 604 

Not reported/other 6.1% 1.4% 43 
Exercise: 

Problem: 


Perform a goodness-of-fit test to determine whether the local results follow the 
distribution of the U. S. Overall Student Population based on ethnicity. 


Exercise: 


Problem: 


Perform a goodness-of-fit test to determine whether the local results follow the 
distribution of U. S. AP Examinee Population, based on ethnicity. 


Solution: 


e 65 
e e13.4 
e £0.0199 


e gDecision: Reject null when a = 0.05; Conclusion: Local data do not fit the AP 
Examinee Distribution. Decision: Do not reject null when a = 0.01; Conclusion: 
There is insufficient evidence to conclude that Local data do not fit the AP 
Examinee Distribution. 


Exercise: 


Problem: 


The City of South Lake Tahoe, CA, has an Asian population of 1419 people, out of a total 
population of 23,609 (Source: U.S. Census Bureau). Suppose that a survey of 1419 self- 
reported Asians in Manhattan, NY, area yielded the data in the table below. Conduct a 
goodness of fit test to determine if the self-reported sub-groups of Asians in the 
Manhattan area fit that of the Lake Tahoe area. 


Race 

Asian Indian 
Chinese 
Filipino 
Japanese 
Korean 
Vietnamese 


Other 


Lake Tahoe Frequency 
131 

118 

1045 

80 

12 

9 


24 


Manhattan Frequency 
174 

bor 

518 

54 

29 

Zt 


66 


The next two questions refer to the following information: UCLA conducted a survey of 
more than 263,000 college freshmen from 385 colleges in fall 2005. The results of student 
expected majors by gender were reported in The Chronicle of Higher Education (2/2/2006). 
Suppose a survey of 5000 graduating females and 5000 graduating males was done as a 
follow-up last year to determine what their actual major was. The results are shown in the 
tables for Exercises 7 and 8. The second column in each table does not add to 100% because 
of rounding. 

Exercise: 


Problem: 


Conduct a hypothesis test to determine if the actual college major of graduating females 
fits the distribution of their expected majors. 


Major Women - Expected Major Women - Actual Major 
Arts & Humanities 14.0% 670 
Biological Sciences 8.4% 410 
Business 13.1% 685 
Education 13.0% 650 
Engineering 2.6% 145 
Physical Sciences 2.6% 125 
Professional 18.9% 975 
Social Sciences 13.0% 605 
Technical 0.4% 15 
Other 5.8% 300 
Undecided 8.0% 420 
Solution: 


e cl10 


e e11.48 

e f0.3214 

¢ hDecision: Do not reject null when a = 0.05 and a = 0.01; Conclusion: There is 
insufficient evidence to conclude that the distribution of majors by graduating 
females does not fit the distribution of expected majors. 


Exercise: 
Problem: 


Conduct a hypothesis test to determine if the actual college major of graduating males fits 
the distribution of their expected majors. 


Major Men - Expected Major Men - Actual Major 
Arts & Humanities 11.0% 600 
Biological Sciences 6.7% 330 
Business 22.7% 1130 
Education 5.8% 305 
Engineering 15.6% 800 
Physical Sciences 3.6% 175 
Professional 9.3% 460 
Social Sciences 7.6% 370 
Technical 1.8% 90 
Other 8.2% 400 
Undecided 6.6% 340 


Exercise: 


Problem: 
A recent debate about where in the United States skiers believe the skiing is best 


prompted the following survey. Test to see if the best ski area is independent of the level 
of the skier. 


U.S. Ski Area Beginner Intermediate Advanced 
Tahoe 20 30 40 
Utah 10 30 60 


Colorado 10 40 50 


Solution: 


e c4 

e e10.53 

¢ f0.0324 

e hDecision: Reject null; Conclusion: Best ski area and level of skier are not 
independent. 


Exercise: 
Problem: 
Car manufacturers are interested in whether there is a relationship between the size of car 
an individual drives and the number of people in the driver’s family (that is, whether car 


size and family size are independent). To test this, suppose that 800 car owners were 
randomly surveyed with the following results. Conduct a test for independence. 


Family Size Sub & Compact Mid-size Full-size Van & Truck 


ul 20 35 40 35 


Family Size 
2 
3-4 


5+ 


Exercise: 


Problem: 


Sub & Compact 
20 
20 


20 


Mid-size Full-size 
50 70 

50 100 

30 70 


Van & Truck 


80 


90 


70 


College students may be interested in whether or not their majors have any effect on 
starting salaries after graduation. Suppose that 300 recent graduates were surveyed as to 
their majors in college and their starting salaries after graduation. Below are the data. 
Conduct a test for independence. 


Major 
English 
Engineering 
Nursing 
Business 


Psychology 


Solution: 


c8 
e33.55 
f0 


events. 


Exercise: 


< $50,000 
5 

10 

10 

10 


20 


$50,000 - $68,999 
20 
30 
15 
20 


30 


$69,000 + 
5 

60 

15 

30 


20 


hDecision: Reject null; Conclusion: Major and starting salary are not independent 


Problem: 


Some travel agents claim that honeymoon hot spots vary according to age of the bride 
and groom. Suppose that 280 East Coast recent brides were interviewed as to where they 
spent their honeymoons. The information is given below. Conduct a test for 
independence. 


Location 20 - 29 30 - 39 40 - 49 50 and over 
Niagara Falls 15 25 25 20 
Poconos 15 25 25 10 
Europe 10 25 15 fs) 
Virgin Islands 20 25 15 fs) 
Exercise: 
Problem: 


A manager of a sports club keeps information concerning the main sport in which 
members participate and their ages. To test whether there is a relationship between the 
age of a member and his or her choice of sport, 643 members of the sports club are 
randomly selected. Conduct a test for independence. 


Sport 18 - 25 26 - 30 31 - 40 41 and over 
racquetball 42 58 30 46 
tennis 58 76 38 65 
swimming 72 60 65 33 


Solution: 


c6 

e25.21 

f0.0003 

hDecision: Reject null 


Exercise: 


Problem: 


A major food manufacturer is concerned that the sales for its skinny French fries have 
been decreasing. As a part of a feasibility study, the company conducts research into the 
types of fries sold across the country to determine if the type of fries sold is independent 
of the area of the country. The results of the study are below. Conduct a test for 
independence. 


Type of Fries Northeast South Central West 
skinny fries 70 50 20 25 
curly fries 100 60 15 30 
steak fries 20 40 10 10 
Exercise: 
Problem: 


According to Dan Lenard, an independent insurance agent in the Buffalo, N.Y. area, the 
following is a breakdown of the amount of life insurance purchased by males in the 
following age groups. He is interested in whether the age of the male and the amount of 
life insurance purchased are independent events. Conduct a test for independence. 


Age of < $200,000 - $401,001 - $1,000,000 
Males None $200,000 $400,000 $1,000,000 + 


20 - 29 40 15 40 0 fs) 


Age of < $200,000 - $401,001 - $1,000,000 


Males None $200,000 $400,000 $1,000,000 + 

30 - 39 35 5 20 20 10 

40 - 49 20 0 30 0 30 

50 + 40 30 15 15 10 
Solution: 

e cl2 

e e125.74 

e f0 


e hDecision: Reject null 


Exercise: 
Problem: 
Suppose that 600 thirty—year—olds were surveyed to determine whether or not there is a 


relationship between the level of education an individual has and salary. Conduct a test 
for independence. 


Not a high 
Annual school High school College Masters or 
Salary graduate graduate graduate doctorate 
7 15 25 10 5 
$30,000 
$30,000 - 
$40,000 20 40 70 30 
$40,000 - 
$50,000 10 20 40 55 
eae 10 20 60 


$60,000 


Not a high 


Annual school High school College Masters or 
Salary graduate graduate graduate doctorate 
ome | x0 5 10 150 
Exercise: 
Problem: 


A Psychologist is interested in testing whether there is a difference in the distribution of 
personality types for business majors and social science majors. The results of the study 
are shown below. Conduct a Test of Homogeneity. Test at a 5% level of significance. 


Open Conscientious Extrovert Agreeable Neurotic 
Business 41 52 46 61 58 
oe 72 75 63 80 65 
Science 
Solution: 
e <4 
e dChi-Square with df = 4 
e e3.01 
e fp-value = 0.5568 
eh 


ii. Do not reject the null hypothesis. 
iv. There is insufficient evidence to conclude that the distribution of personality 
types is different for business and social science majors. 


Exercise: 
Problem: 
Do men and women select different breakfasts? The breakfast ordered by randomly 


selected men and women at a popular breakfast place is shown below. Conduct a test of 
homogeneity. Test at a 5% level of significance 


French Toast Pancakes Waffles Omelettes 
Men 47 35 28 53 


Women 65 59 55 60 


Solution: 


e 63 

e e4.01 

e fp-value = 0.2601 

eh 
ii. Do not reject the null hypothesis. 
iv. There is insufficient evidence to conclude that the distribution of breakfast 
ordered is different for men and women. 


Exercise: 


Problem: 


Is there a difference between the distribution of community college statistics students and 
the distribution of university statistics students in what technology they use on their 
homework? Of the randomly selected community college students 43 used a computer, 
102 used a calculator with built in statistics functions, and 65 used a table from the 
textbook. Of the randomly selected university students 28 used a computer, 33 used a 
calculator with built in statistics functions, and 40 used a table from the textbook. 
Conduct an appropriate hypothesis test using a 0.05 level of significance. 


Solution: 


° c2 

e e7.05 

¢ fp-value = 0.0294 

eh 
ii. Reject the null hypothesis. 
iv. There is sufficient evidence to conclude that the distribution of technology use 
for statistics homework is not the same for statistics students at community colleges 
and at universities. 


Exercise: 


Problem: 


A fisherman is interested in whether the distribution of fish caught in Green Valley Lake 
is the same as the distribution of fish caught in Echo Lake. Of the 191 randomly selected 
fish caught in Green Valley Lake, 105 were rainbow trout, 27 were other trout, 35 were 
bass, and 24 were catfish. Of the 293 randomly selected fish caught in Echo Lake, 115 
were rainbow trout, 58 were other trout, 67 were bass, and 53 were catfish. Perform the 
hypothesis test at a 5% level of significance. 


Solution: 


¢ 3 

e dChi-Square with df = 3 

e e11.75 

¢ fp-value = 0.0083 

eh 
ii. Reject the null hypothesis. 
iv. There is sufficient evidence to conclude that the distribution of fish in Green 
Valley Lake is not the same as the distribution of fish in Echo Lake. 


Exercise: 


Problem: 


A plant manager is concerned her equipment may need recalibrating. It seems that the 
actual weight of the 15 oz. cereal boxes it fills has been fluctuating. The standard 
deviation should be at most + oz. In order to determine if the machine needs to be 
recalibrated, 84 randomly selected boxes of cereal from the next day’s production were 
weighed. The standard deviation of the 84 boxes was 0.54. Does the machine need to be 
recalibrated? 


Solution: 


e 83 

¢ dChi-Square with df = 83 

e e96.81 

¢ fp-value = 0.1426; There is a 0.1426 probability that the sample standard deviation 
is 0.54 or more. 

¢ hDecision: Do not reject null; Conclusion: There is insufficient evidence to 
conclude that the standard deviation is more than 0.5 oz. It cannot be determined 
whether the equipment needs to be recalibrated or not. 


Exercise: 


Problem: 


Consumers may be interested in whether the cost of a particular calculator varies from 
store to store. Based on surveying 43 stores, which yielded a sample mean of $84 and a 
sample standard deviation of $12, test the claim that the standard deviation is greater than 
S15; 


Exercise: 


Problem: 


Isabella, an accomplished Bay to Breakers runner, claims that the standard deviation for 
her time to run the 7 % mile race is at most 3 minutes. To test her claim, Rupinder looks 
up 5 of her race times. They are 55 minutes, 61 minutes, 58 minutes, 63 minutes, and 57 
minutes. 


Solution: 


e c4 

e dChi-Square with df = 4 

e e4.52 

¢ £0.3402 

e hDecision: Do not reject null. 


Exercise: 


Problem: 


Airline companies are interested in the consistency of the number of babies on each 
flight, so that they have adequate safety equipment. They are also interested in the 
variation of the number of babies. Suppose that an airline executive believes the average 
number of babies on flights is 6 with a variance of 9 at most. The airline conducts a 
survey. The results of the 18 flights surveyed give a sample average of 6.4 with a sample 
standard deviation of 3.9. Conduct a hypothesis test of the airline executive’s belief. 


Exercise: 


Problem: 


The number of births per woman in China is 1.6 down from 5.91 in 1966 (SourceWorld 
Bank, 6/5/12). This fertility rate has been attributed to the law passed in 1979 restricting 
births to one per woman. Suppose that a group of students studied whether or not the 
standard deviation of births per woman was greater than 0.75. They asked 50 women 
across China the number of births they had. Below are the results. Does the students’ 
survey indicate that the standard deviation is greater than 0.75? 


# of births Frequency 


0 5 

iL 30 

2 10 

3 5 
Solution: 

e c49 

¢ dChi-Square with df = 49 

e e54.37 


¢ fp-value = 0.2774; If the null hypothesis is true, there is a 0.2774 probability that the 
sample standard deviation is 0.79 or more. 

e hDecision: Do not reject null; Conclusion: There is insufficient evidence to 
conclude that the standard deviation is more than 0.75. It cannot be determined if the 
standard deviation is greater than 0.75 or not. 


Exercise: 


Problem: 


According to an avid aquariest, the average number of fish in a 20—gallon tank is 10, with 
a standard deviation of 2. His friend, also an aquariest, does not believe that the standard 
deviation is 2. She counts the number of fish in 15 other 20—gallon tanks. Based on the 
results that follow, do you think that the standard deviation is different from 2? Data: 11; 
10; 9; 10; 10; 11; 11; 10; 12; 9; 7; 9; 11; 10; 11 

Exercise: 


Problem: 


The manager of "Frenchies" is concerned that patrons are not consistently receiving the 
same amount of French fries with each order. The chef claims that the standard deviation 
for a 10-ounce order of fries is at most 1.5 oz., but the manager thinks that it may be 
higher. He randomly weighs 49 orders of fries, which yields a mean of 11 oz. anda 
standard deviation of 2 oz. 


Solution: 
y) 2 
°* ao” < (1.5) 


e c48 
e dChi-Square with df = 48 


e e85.33 
¢ f0.0007 
e hDecision: Reject null. 


Try these true/false questions. 


Exercise: 


Problem: 


As the degrees of freedom increase, the graph of the chi-square distribution looks more 
and more symmetrical. 


Solution: 
True 
Exercise: 
Problem: The standard deviation of the chi-square distribution is twice the mean. 
Solution: 


False 
Exercise: 


Problem: 
The mean and the median of the chi-square distribution are the same if df = 24. 
Solution: 


False 
Exercise: 


Problem: 


In a Goodness-of-Fit test, the expected values are the values we would expect if the null 
hypothesis were true. 


Solution: 


True 


Exercise: 


Problem: 


In general, if the observed values and expected values of a Goodness-of-Fit test are not 
close together, then the test statistic can get very large and on a graph will be way out in 
the right tail. 


Solution: 


True 
Exercise: 


Problem: 
The degrees of freedom for a Test for Independence are equal to the sample size minus 1. 
Solution: 


False 
Exercise: 


Problem: 


Use a Goodness-of-Fit test to determine if high school principals believe that students are 
absent equally during the week or not. 


Solution: 
True 


Exercise: 


Problem: The Test for Independence uses tables of observed and expected data values. 
Solution: 


True 
Exercise: 


Problem: 


The test to use when determining if the college or university a student chooses to attend 
is related to his/her socioeconomic status is a Test for Independence. 


Solution: 


True 


Exercise: 


Problem: The test to use to determine if a six-sided die is fair is a Goodness-of-Fit test. 
Solution: 


True 
Exercise: 


Problem: 


In a Test of Independence, the expected number is equal to the row total multiplied by the 
column total divided by the total surveyed. 


Solution: 


True 
Exercise: 


Problem: 


In a Goodness-of Fit test, if the p-value is 0.0113, in general, do not reject the null 
hypothesis. 


Solution: 


False 
Exercise: 


Problem: 


For a Chi-Square distribution with degrees of freedom of 17, the probability that a value 
is greater than 20 is 0.7258. 


Solution: 


False 
Exercise: 


Problem: 
If df = 2, the chi-square distribution has a shape that reminds us of the exponential. 
Solution: 


True 


Review 

This module provides an review on Chi-Square Distribution as a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 


The next two questions refer to the following real study: 


A recent survey of U.S. teenage pregnancy was answered by 720 girls, age 
12 - 19. 6% of the girls surveyed said they have been pregnant. (Parade 
Magazine) We are interested in the true proportion of U.S. girls, age 12 - 
19, who have been pregnant. 

Exercise: 


Problem: 


Find the 95% confidence interval for the true proportion of U.S. girls, 
age 12 - 19, who have been pregnant. 


Solution: 


(0.0424,0.0770) 

Exercise: 
Problem: 
The report also stated that the results of the survey are accurate to 
within + 3.7% at the 95% confidence level. Suppose that a new study 
is to be done. It is desired to be accurate to within 2% of the 95% 


confidence level. What is the minimum number that should be 
surveyed? 


Solution: 


2401 
Exercise: 


Problem: 


Given: ~ Exp 3 . Sketch the graph that depicts: ( > 1). 


The next four questions refer to the following information: 


Suppose that the time that owners keep their cars (purchased new) is 
normally distributed with a mean of 7 years and a standard deviation of 2 
years. We are interested in how long an individual keeps his car (purchased 
new). Our population is people who buy their cars new. 

Exercise: 


Problem: 
60% of individuals keep their cars at most how many years? 
Solution: 


75 
Exercise: 


Problem: 


Suppose that we randomly survey one person. Find the probability that 
person keeps his/her car less than 2.5 years. 


Solution: 


0.0122 
Exercise: 


Problem: 


If we are to pick individuals 10 at a time, find the distribution for the 
mean car length ownership. 


Solution: 


(7,0.63) 


Exercise: 


Problem: 


If we are to pick 10 individuals, find the probability that the sum of 
their ownership time is more than 55 years. 


Solution: 


0:9911 


Exercise: 


Problem: For which distribution is the median not equal to the mean? 


e AUniform 

e BExponential 
e CNormal 

e DStudent-t 


Solution: 


B 
Exercise: 


Problem: 


Compare the standard normal distribution to the student-t distribution, 
centered at 0. Explain which of the following are true and which are 
false. 


e aAs the number surveyed increases, the area to the left of -1 for 
the student-t distribution approaches the area for the standard 
normal distribution. 

e bAs the degrees of freedom decrease, the graph of the student-t 
distribution looks more like the graph of the standard normal 
distribution. 

e clf the number surveyed is 15, the normal distribution should 
never be used. 


Solution: 


e alrue 
e bFalse 
e cFalse 


The next five questions refer to the following information: 


We are interested in the checking account balance of a twenty-year-old 
college student. We randomly survey 16 twenty-year-old college students. 
We obtain a sample mean of $640 and a sample standard deviation of $150. 
Let = checking account balance of an individual twenty year old college 
student. 

Exercise: 


Problem: Explain why we cannot determine the distribution of 
Exercise: 

Problem: 

If you were to create a confidence interval or perform a hypothesis test 


for the population mean checking account balance of 20-year old 
college students, what distribution would you use? 


Solution: 


student-t with df = 15 
Exercise: 


Problem: 


Find the 95% confidence interval for the true mean checking account 
balance of a twenty-year-old college student. 


Solution: 


(560.07,719.93) 
Exercise: 
Problem: 


What type of data is the balance of the checking account considered to 
be? 


Solution: 


quantitative - continuous 
Exercise: 


Problem: 


What type of data is the number of 20 year olds considered to be? 


Solution: 


quantitative - discrete 
Exercise: 
Problem: 
On average, a busy emergency room gets a patient with a shotgun 


wound about once per week. We are interested in the number of 
patients with a shotgun wound the emergency room gets per 28 days. 


e aDefine the random variable 

e bState the distribution for 

e cFind the probability that the emergency room gets no patients 
with shotgun wounds in the next 28 days. 


Solution: 


-b (4) 
© c0.0183 


The next two questions refer to the following information: 


The probability that a certain slot machine will pay back money when a 
quarter is inserted is 0.30 . Assume that each play of the slot machine is 
independent from each other. A person puts in 15 quarters for 15 plays. 
Exercise: 


Problem: 


Is the expected number of plays of the slot machine that will pay back 
money greater than, less than or the same as the median? Explain your 
answer. 


Solution: 


greater than 
Exercise: 


Problem: 


Is it likely that exactly 8 of the 15 plays would pay back money? 
Justify your answer numerically. 


Solution: 


No; ( =8) = 0.0348 


Exercise: 


Problem: A game is played with the following rules: 


¢ it costs $10 to enter 

e a fair coin is tossed 4 times 

¢ if you do not get 4 heads or 4 tails, you lose your $10 

e if you get 4 heads or 4 tails, you get back your $10, plus $30 more 


Over the long run of playing this game, what are your expected 
earnings? 


Solution: 


You will lose $5 
Exercise: 
Problem: 
e The mean grade on a math exam in Rachel’s class was 74, with a 
standard deviation of 5. Rachel earned an 80. 
e The mean grade on a math exam in Becca’s class was 47, witha 
standard deviation of 2. Becca earned a 51. 


e The mean grade on a math exam in Matt’s class was 70, with a 
standard deviation of 8. Matt earned an 83. 


Find whose score was the best, compared to his or her own class. 
Justify your answer numerically. 
Solution: 


Becca 


The next two questions refer to the following information: 


A random sample of 70 compulsive gamblers were asked the number of 
days they go to casinos per week. The results are given in the following 
graph: 


Relative Frequency 


0.3 
0.2 
0.1 
1 2 3 4 5 6 7 
Number of Days 
Exercise: 


Problem: Find the number of responses that were “5". 


Solution: 


14 
Exercise: 
Problem: 


Find the mean, standard deviation, the median, the first quartile, the 
third quartile and the IQR. 


Solution: 


e Sample mean = 3.2 

e Sample standard deviation = 1.85 
e Median = 3 

e Quartile 1 = 2 


¢ Quartile 3=5 
e IQR=3 


Exercise: 


Problem: 


Based upon research at De Anza College, it is believed that about 19% 
of the student population speaks a language other than English at 
home. 


Suppose that a study was done this year to see if that percent has 
decreased. Ninety-eight students were randomly surveyed with the 
following results. Fourteen said that they speak a language other than 
English at home. 


e aState an appropriate null hypothesis. 

e bState an appropriate alternate hypothesis. 

e cDefine the Random Variable, ’. 

e dCalculate the test statistic. 

e eCalculate the p-value. 

e fAt the 5% level of decision, what is your decision about the null 
hypothesis? 

e gWhat is the Type I error? 

e¢ hWhat is the Type II error? 


Solution: 


ed =-1.19 
e e0.1171 
e fDo not reject the null 


Exercise: 


Problem: 


Assume that you are an emergency paramedic called in to rescue 
victims of an accident. You need to help a patient who is bleeding 
profusely. The patient is also considered to be a high risk for 
contracting AIDS. Assume that the null hypothesis is that the patient 
does not have the HIV virus. What is a Type I error? 


Solution: 


We conclude that the patient does have the HIV virus when, in fact, the 
patient does not. 


Exercise: 


Problem: 


It is often said that Californians are more casual than the rest of 
Americans. Suppose that a survey was done to see if the proportion of 
Californian professionals that wear jeans to work is greater than the 
proportion of non-Californian professionals. Fifty of each was 
surveyed with the following results. 15 Californians wear jeans to 
work and 6 non-Californians wear jeans to work. 


e = Californian professional 
e NC = non-Californian professional 


e aState appropriate null and alternate hypotheses. 

e bDefine the Random Variable. 

e cCalculate the test statistic and p-value. 

e dAt the 5% significance level, what is your decision? 
e eWhat is the Type I error? 

e fWhat is the Type II error? 


Solution: 


ec = 291: =]001356 
e dReject the null 


e eWe conclude that the proportion of Californian professionals that 
wear jeans to work is greater than the proportion of non- 
Californian professionals when, in fact, it is not greater. 

e fWe cannot conclude that the proportion of Californian 
professionals that wear jeans to work is greater than the 
proportion of non-Californian professionals when, in fact, it is 
greater. 


The next two questions refer to the following information: 


A group of Statistics students have developed a technique that they feel will 
lower their anxiety level on statistics exams. They measured their anxiety 
level at the start of the quarter and again at the end of the quarter. Recorded 
is the paired data in that order: (1000, 900); (1200, 1050); (600, 700); 
(1300, 1100); (1000, 900); (900, 900). 

Exercise: 


Problem: This is a test of (pick the best answer): 


e Alarge samples, independent means 
e¢ Bsmall samples, independent means 
e Cdependent means 


Solution: 


C 


Exercise: 


Problem: State the distribution to use for the test. 


Solution: 


Chi Squared Distribution Lab 
This lab combines both the Chi-Squared goodness of fit and test for 
independence. It also incorporates the use of minitab software. 


Chi-Squared Hyponthesis Tests 


Name: 


Student Learning Outcomes: 


e The student will select the appropriate Chi-Squared Hypothesis Tests. 
e The student will conduct hypothesis tests and interpret the results. 


Candy Lab Revisited 


Suppose the color distribution for the candy from your initial Candy Lab is 
uniform. Based on the class data, does it appear that the distribution is 
actually uniform? (a = 0.08) 


1. Ho: Ha: 

2. Which y? - distribution test will be used? 

3. Using Minitab, draw the y? probability graph and label it appropriately. 
Attach the graph to this lab. 

4. What is the formula for calculating the test statistic? 

5. Calculate the test statistic and the p-value using Minitab. Include the 
session window output. 


test statistic = p-value = 


Do you reject or fail to reject the null hypothesis? Why? (Use a complete 
sentence.) 


Write a clear conclusion using a complete sentence. 


Favorite Snacks 


Is the choice of snack food dependent on gender? Survey sufficient 
numbers of fellow classmates and others to ensure that the conditions for 
this y’ test are met. (a = 0.05) 


Final Count: 


Chips and Fruit and 


Pretzels Vegetables foe 


Sweets 


Male 
Female 


Total 


Loy Ha: 

2. Which y? - distribution test will be used? 

3. Using Minitab, draw the y? probability graph and label it appropriately. 
Attach the graph to this lab. 

4. What is the formula for calculating the test statistic? 

5. Calculate the test statistic and the p-value using Minitab. Include the 
session window output. 


test statistic = p-value = 


Do you reject or fail to reject the null hypothesis? Why? (Use a complete 
sentence.) 


Write a clear conclusion using a complete sentence. 


Linear Regression and Correlation 

This module provides an introduction of Linear Regression and Correlation 
as a part of Collaborative Statistics collection (col10522) by Barbara 
Illowsky and Susan Dean. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Discuss basic ideas of linear regression and correlation. 
e Create and interpret a line of best fit. 

e Calculate and interpret the correlation coefficient. 

e Calculate and interpret outliers. 


Introduction 


Professionals often want to know how two or more numeric variables are 
related. For example, is there a relationship between the grade on the 
second math exam a student takes and the grade on the final exam? If there 
is a relationship, what is it and how strong is the relationship? 


In another example, your income may be determined by your education, 
your profession, your years of experience, and your ability. The amount you 
pay a repair person for labor is often determined by an initial amount plus 
an hourly fee. These are all examples in which regression can be used. 


The type of data described in the examples is bivariate data - "bi" for two 
variables. In reality, statisticians use multivariate data, meaning many 
variables. 


In this chapter, you will be studying the simplest form of regression, "linear 

regression" with one independent variable (_). This involves data that fits a 

line in two dimensions. You will also study correlation which measures how 
strong the relationship is. 


Linear Regression and Correlation: Linear Equations 


Linear regression for two variables is based on a linear equation with one 
independent variable. It has the form: 
Equation: 


y=a+bx 


where a and b are constant numbers. 


x is the independent variable, and y is the dependent variable. 
Typically, you choose a value to substitute for the independent variable and 
then solve for the dependent variable. 


Example: 
The following examples are linear equations. 
Equation: 

v— 54 ok 
Equation: 


y = —0.01 + 1.2x 


The graph of a linear equation of the form y = a + bx is a straight line. 
Any line that is not vertical can be described by this equation. 


Example: 


15 


Graph of the equation y = —1 + 2x. 


Linear equations of this form occur in applications of life sciences, social 


sciences, psychology, business, economics, physical sciences, mathematics, 
and other areas. 


Example: 
Aaron's Word Processing Service (AWPS) does word processing. Its rate is 
$32 per hour plus a $31.50 one-time charge. The total cost to a customer 


depends on the number of hours it takes to do the word processing job. 
Exercise: 


Problem: 


Find the equation that expresses the total cost in terms of the number 
of hours required to finish the word processing job. 


Solution: 


Let x = the number of hours it takes to get the job done. 
Let y = the total cost to the customer. 


The $31.50 is a fixed cost. If it takes x hours to complete the job, then 
(32) (a) is the cost of the word processing only. The total cost is: 


y = 31.50 + 32x 


Linear Regression and Correlation: Slope and Y-Intercept of a Linear 
Equation 


For the linear equation y = a + bx, b= slope and a = y-intercept. 


From algebra recall that the slope is a number that describes the steepness 
of a line and the y-intercept is the y coordinate of the point (0, a) where the 
line crosses the y-axis. 


If b > O, the line If b = 0, the If b < OQ, the line slopes 
slopes upward to the line is downward to the right. 
right. horizontal. 


=| -—) FES 


Three possible graphs of y = a + bx. 


Example: 

Svetlana tutors to make extra money for college. For each tutoring session, 
she charges a one time fee of $25 plus $15 per hour of tutoring. A linear 
equation that expresses the total amount of money Svetlana earns for each 
session she tutors is y = 25 + 15x. 

Exercise: 


Problem: 
What are the independent and dependent variables? What is the y- 
intercept and what is the slope? Interpret them using complete 


sentences. 


Solution: 


The independent variable (x) is the number of hours Svetlana tutors 
each session. The dependent variable (y) is the amount, in dollars, 
Svetlana earns for each session. 


The y-intercept is 25 (a = 25). At the start of the tutoring session, 
Svetlana charges a one-time fee of $25 (this is when x = 0). The slope 
is 15 (b = 15). For each session, Svetlana earns $15 for each hour she 
tutors. 


Scatter Plots 

This module provides an overview of Linear Regression and Correlation: 
Scatter Plots as a part of Collaborative Statistics collection (col10522) by 
Barbara Illowsky and Susan Dean. 


Before we take up the discussion of linear regression and correlation, we 
need to examine a way to display the relation between two variables x and 
y. The most common and easiest way is a scatter plot. The following 
example illustrates a scatter plot. 


Example: 

From an article in the Wall Street Journal: In Europe and Asia, m- 
commerce is popular. M-commerce users have special mobile phones that 
work like electronic wallets as well as provide phone and Internet services. 
Users can do everything from paying for parking to buying a TV set or 
soda from a machine to banking to checking sports scores on the Internet. 
For the years 2000 through 2004, was there a relationship between the year 
and the number of m-commerce users? Construct a scatter plot. Let x = the 
year and let y = the number of m-commerce users, in millions. 


Table showing the number of Scatter plot showing the number 
m-commerce users (in of m-commerce users (in 
millions) by year. millions) by year. 

50 
g 
+ 
x (year) y (# of users) . 

0 

2000 0.5 2000 2002.~=—-2004 


x = year 


2002 20.0 


x (year) y (# of users) 
2003 Ba.0 


2004 47.0 


A scatter plot shows the direction and strength of a relationship between 
the variables. A clear direction happens when there is either: 


¢ High values of one variable occurring with high values of the other 
variable or low values of one variable occurring with low values of the 
other variable. 

e High values of one variable occurring with low values of the other 
variable. 


You can determine the strength of the relationship by looking at the scatter 
plot and seeing how close the points are to a line, a power function, an 
exponential function, or to some other type of function. 


When you look at a scatterplot, you want to notice the overall pattern and 
any deviations from the pattern. The following scatterplot examples 
illustrate these concepts. 

Positive Linear Pattern (Strong) 


Linear Pattern w/ One Deviation 


Exponential Growth Pattern 


o 
eo 
@e0ee ® 
No Pattern 
ee co) 
®e e 
e oO 
eo? fo} 


In this chapter, we are interested in scatter plots that show a linear pattern. 
Linear patterns are quite common. The linear relationship is strong if the 
points are close to a straight line. If we think that the points show a linear 
relationship, we would like to draw a line on the scatter plot. This line can 
be calculated through a process called linear regression. However, we only 
calculate a regression line if one of the variables helps to explain or predict 


the other variable. If x is the independent variable and y the dependent 
variable, then we can use a regression line to predict y for a given value of 
©; 


The Regression Equation 

Linear Regression and Correlation: The Regression Equation is a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. Contributions from Roberta Bloom include instructions for 
finding and graphing the regression equation and scatterplot using the 
LinRegTTest on the TI-83,83+,84+ calculators. 


Data rarely fit a straight line exactly. Usually, you must be satisfied with 
rough predictions. Typically, you have a set of data whose scatter plot 
appears to "fit" a straight line. This is called a Line of Best Fit or Least 
Squares Line. 


Optional Collaborative Classroom Activity 


If you know a person's pinky (smallest) finger length, do you think you 
could predict that person's height? Collect data from your class (pinky 
finger length, in inches). The independent variable, z, is pinky finger length 
and the dependent variable, y, is height. 


For each set of data, plot the points on graph paper. Make your graph big 
enough and use a ruler. Then "by eye" draw a line that appears to "fit" the 
data. For your line, pick two convenient points and use them to find the 
slope of the line. Find the y-intercept of the line by extending your lines so 
they cross the y-axis. Using the slopes and the y-intercepts, write your 
equation of "best fit". Do you think everyone will have the same equation? 
Why or why not? 


Using your equation, what is the predicted height for a pinky length of 2.5 
inches? 


Example: 

A random sample of 11 statistics students produced the following data 
where z is the third exam score, out of 80, and y is the final exam score, 
out of 200. Can you predict the final exam score of a random student if you 
know the third exam score? 


Table showing the scores on Scatter plot showing the scores 


the final exam based on scores on the final exam based on scores 
from the third exam. from the third exam. 
250 
8 200 - fe) 
E 150 - 4 as eee 
hy 100 - 
y (final iz 7 | 
x (third exam 60 65 70 75 80 
exam score) score) Third Exam Score 
65 75 
67 188) 
Ta 185 
TEs 163 
66 126 
75 198 
67 58) 
70 163 
7A 159 
69 151 


69 Sg 


The third exam score, 2, is the independent variable and the final exam 
score, y, is the dependent variable. We will plot a regression line that best 
"fits" the data. If each of you were to fit a line "by eye", you would draw 
different lines. We can use what is called a least-squares regression line to 
obtain the best fit line. 


Consider the following diagram. Each point of data is of the the form (a, y) 
and each point of the line of best fit using least-squares linear regression has 
the form (z, g). 


The y is read "y hat" and is the estimated value of y. It is the value of y 
obtained using the regression line. It is not generally equal to y from data. 


data point = (x,, y,) 


distance = ly, - y. =|e, 


point on line = (x, y,) 


64 69 74 


The term yo — Yo = €o is called the "error" or residual. It is not an error 
in the sense of a mistake. The absolute value of a residual measures the 
vertical distance between the actual value of y and the estimated value of y. 
In other words, it measures the vertical distance between the actual data 
point and the predicted point on the line. 


If the observed data point lies above the line, the residual is positive, and 
the line underestimates the actual data value for y. If the observed data 
point lies below the line, the residual is negative, and the line overestimates 
that actual data value for y. 


In the diagram above, yo — Yo = €o is the residual for the point shown. Here 
the point lies above the line and the residual is positive. 


€ = the Greek letter epsilon 


For each data point, you can calculate the residuals or errors, y; — Y; = &; 
fOr d= Ls 2. By aver Ld 


Each |e| is a vertical distance. 


For the example about the third exam scores and the final exam scores for 
the 11 statistics students, there are 11 data points. Therefore, there are 11 € 
values. If you square each € and add, you get 


11 
(2) (s)rr pe -be 


This is called the Sum of Squared Errors (SSE). 


Using calculus, you can determine the values of a and b that make the SSE 
a minimum. When you make the SSE a minimum, you have determined the 
points that are on the line of best fit. It turns out that the line of best fit has 
the equation: 


Equation: 
y=a+bx 
where a = y—b- x andb= oa 


x and y are the sample means of the x values and the y values, respectively. 
The best fit line always passes through the point (z, y). 


The slope 6 can be written as b = r- (=) where s, = the standard 


deviation of the y values and s,. = the standard deviation of the x values. r 
is the correlation coefficient which is discussed in the next section. 


Least Squares Criteria for Best Fit 

The process of fitting the best fit line is called linear regression. The idea 
behind finding the best fit line is based on the assumption that the data are 
scattered about a straight line. The criteria for the best fit line is that the 
sum of the squared errors (SSE) is minimized, that is made as small as 
possible. Any other line you might choose would have a higher SSE than 
the best fit line. This best fit line is called the least squares regression line 


Note:Computer spreadsheets, statistical software, and many calculators 
can quickly calculate the best fit line and create the graphs. The 
calculations tend to be tedious if done by hand. Instructions to use the TI- 
83, TI-83+, and TI-84+ calculators to find the best fit line and create a 
scatterplot are shown at the end of this section. 


THIRD EXAM vs FINAL EXAM EXAMPLE: 
The graph of the line of best fit for the third exam/final exam example is 
shown below: 
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Third Exam Score 


The least squares regression line (best fit line) for the third exam/final exam 
example has the equation: 
Equation: 


j = —-173.51 + 4.83x 


Note: 


e Remember, it is always important to plot a scatter diagram first. If the 
scatter plot indicates that there is a linear relationship between the 
variables, then it is reasonable to use a best fit line to make 
predictions for y given x within the domain of z-values in the sample 
data, but not necessarily for z-values outside that domain. 

e You could use the line to predict the final exam score for a student 
who eared a grade of 73 on the third exam. 

e You should NOT use the line to predict the final exam score for a 
student who earned a grade of 50 on the third exam, because 50 is not 
within the domain of the x-values in the sample data, which are 
between 65 and 75. 


UNDERSTANDING SLOPE 

The slope of the line, b, describes how changes in the variables are related. 
It is important to interpret the slope of the line in the context of the situation 
represented by the data. You should be able to write a sentence interpreting 
the slope in plain English. 


INTERPRETATION OF THE SLOPE: The slope of the best fit line tells 
us how the dependent variable (y) changes for every one unit increase in the 
independent (x) variable, on average. 

THIRD EXAM vs FINAL EXAM EXAMPLE 


e Slope: The slope of the line is b = 4.83. 


e Interpretation: For a one point increase in the score on the third exam, 
the final exam score increases by 4.83 points, on average. 


Using the TI-83+ and TI-84+ Calculators 
Using the Linear Regression T Test: LinRegT Test 


In the STAT list editor, enter the X data in list L1 and the Y data in list L2, 
paired so that the corresponding (x,y) values are next to each other in the 
lists. (If a particular pair of values is repeated, enter it as many times as it 
appears in the data.) 

On the STAT TESTS menu, scroll down with the cursor to select the 
LinRegT Test. (Be careful to select LinRegTTest as some calculators may 
also have a different item called LinRegTInt.) 

On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1 

On the next line, at the prompt £ or p, highlight "4 0" and press ENTER 
Leave the line for "RegEq:" blank 

Highlight Calculate and press ENTER. 


LinRegTTest Input Screen and Output Screen 


LinRegT Test 
Xlist: L1 
Ylist: L2 
Freq: 1 


LinRegTTest 
y=at+bx 

£#0 and p#0 
t= 2.657560155 
p = .0261501512 
df=9 


B or p :[#0]<0 >0 
RegEQ: 
Calculate 


Ja = -173.513363 
b = 4.827394209 


s = 16.41237711 
r2= .4396931104 
r= 663093591 


TI-83+ and TI-84+ 
calculators 


The output screen contains a lot of information. For now we will focus on a 
few items from the output, and will return later to the other items. 


e The second line says y=at+bx. Scroll down to find the values 
a=-173.513, and b=4.8273 ; the equation of the best fit line is 
y = —173.51+ 4.832 

¢ The two items at the bottom are r? = .43969 and r=.663. For now, just 
note where to find these values; we will discuss them in the next two 
sections. 


Graphing the Scatterplot and Regression Line 


We are assuming your X data is already entered in list L1 and your Y data is 
in list L2 

Press 2nd STATPLOT ENTER to use Plot 1 

On the input screen for PLOT 1, highlightOnand press ENTER 

For TYPE: highlight the very first icon which is the scatterplot and press 
ENTER 

Indicate Xlist: L1 and Ylist: L2 

For Mark: it does not matter which symbol you highlight. 

Press the ZOOM key and then the number 9 (for menu item "ZoomStat") ; 
the calculator will fit the window to the data 

To graph the best fit line, press the "Y=" key and type the equation 
-173.5+4.83X into equation Y1. (The X key is immediately left of the STAT 
key). Press ZOOM 9 again to graph it. 

Optional: If you want to change the viewing window, press the WINDOW 
key. Enter your desired window using Xmin, Xmax, Ymin, Ymax 


**With contributions from Roberta Bloom 


The Correlation Coefficient 

Linear Regression and Correlation: The Correlation Coefficient and 
Coefficient of Determination is a part of Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean with contributions from 
Roberta Bloom. The name has been changed from Correlation Coefficient. 


The Correlation Coefficient r 


Besides looking at the scatter plot and seeing that a line seems reasonable, 
how can you tell if the line is a good predictor? Use the correlation 
coefficient as another indicator (besides the scatterplot) of the strength of 
the relationship between x and y. 


The correlation coefficient, r, developed by Karl Pearson in the early 
1900s, is a numerical measure of the strength of association between the 
independent variable x and the dependent variable y. 


The correlation coefficient is calculated as 
Equation: 


n Dar-y — (Be) (Zy) 
in Sa? — (Ex) [n- Dy? — (Sy)?| 


where n = the number of data points. 


If you suspect a linear relationship between z and y, then r can measure 
how strong the linear relationship is. 
What the VALUE of r tells us: 


e The value of r is always between -1 and +1: -1 <r <1. 

e The size of the correlation r indicates the strength of the linear 
relationship between z and y. Values of r close to -1 or to +1 indicate a 
stronger linear relationship between x and y. 

e If r=0 there is absolutely no linear relationship between x and y (no 
linear correlation). 


e If r = 1, there is perfect positive correlation. If 7 = —1, there is 
perfect negative correlation. In both these cases, all of the original data 
points lie on a straight line. Of course, in the real world, this will not 
generally happen. 


What the SIGN of r tells us 


e A positive value of r means that when z increases, y tends to increase 
and when « decreases, y tends to decrease (positive correlation). 

e A negative value of r means that when z increases, y tends to decrease 
and when z decreases, y tends to increase (negative correlation). 

e The sign of r is the same as the sign of the slope, b, of the best fit line. 


Note:Strong correlation does not suggest that 2 causes y or y causes x. We 
say "correlation does not imply causation." For example, every person 
who learned math in the 17th century is dead. However, learning math does 
not necessarily cause death! 


Positive Correlation 


A scatter plot 
showing data 
with a 
positive 
correlation. 
O0<r<l 


Negative Correlation 


A scatter plot 
showing data 
with a 
negative 
correlation. 
—l<r<0 


Zero Correlation 


A scatter plot 
showing data 
with zero 
correlation. r 
=0 


The formula for r looks formidable. However, computer spreadsheets, 
statistical software, and many calculators can quickly calculate r. The 
correlation coefficient r is the bottom item in the output screens for the 


LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous 
section for instructions). 


The Coefficient of Determination 


r? is called the coefficient of determination. r? is the square of the 


correlation coefficient , but is usually stated as a percent, rather than in 
decimal form. r? has an interpretation in the context of the data: 


e r”, when expressed as a percent, represents the percent of variation in 
the dependent variable y that can be explained by variation in the 
independent variable x using the regression (best fit) line. 

e 1-r?, when expressed as a percent, represents the percent of variation 
in y that is NOT explained by variation in x using the regression line. 
This can be seen as the scattering of the observed data points about the 
regression line. 


Consider the third exam/final exam example introduced in the previous 
section 


e The line of best fit is: 7 = —173.51 + 4.83x 

e The correlation coefficient is r = 0.6631 

¢ The coefficient of determination is r? = 0.66317 = 0.4397 

¢ Interpretation of r’ in the context of this example: 

e Approximately 44% of the variation (0.4397 is approximately 0.44) in 
the final exam grades can be explained by the variation in the grades 
on the third exam, using the best fit regression line. 

e Therefore approximately 56% of the variation (1 - 0.44 = 0.56) in the 
final exam grades can NOT be explained by the variation in the grades 
on the third exam, using the best fit regression line. (This is seen as the 
scattering of the points about the line.) 


**With contributions from Roberta Bloom. 


Glossary 


Coefficient of Correlation 
A measure developed by Karl Pearson (early 1900s) that gives the 
strength of association between the independent variable and the 
dependent variable. The formula is: 
Equation: 


ndxy — Ol #)QOvy) 
Jind 2? — (C2) Ov - Oly) 
where n is the number of data points. The coefficient cannot be more 


then 1 and less then -1. The closer the coefficient is to +1, the stronger 
the evidence of a significant linear relationship between x and y. 


— 


Facts About the Correlation Coefficient for Linear Regression 

Linear Regression and Correlation: Testing the Significance of the 
Correlation Coefficient is a part of Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean. The title has been 
changed from Facts About the Correlation Coefficient for Linear 
Regression. Roberta Bloom has made major contributions to this module. 


Testing the Significance of the Correlation Coefficient 


The correlation coefficient, r, tells us about the strength of the linear 
relationship between x and y. However, the reliability of the linear model 
also depends on how many observed data points are in the sample. We need 
to look at both the value of the correlation coefficient r and the sample size 
n, together. 


We perform a hypothesis test of the "significance of the correlation 
coefficient" to decide whether the linear relationship in the sample data is 
strong enough to use to model the relationship in the population. 


The sample data is used to compute 7, the correlation coefficient for the 
sample. If we had data for the entire population, we could find the 
population correlation coefficient. But because we only have sample data, 
we can not calculate the population correlation coefficient. The sample 
correlation coefficient, r, is our estimate of the unknown population 
correlation coefficient. 


e The symbol for the population correlation coefficient is p, the Greek 
letter "rho". 

¢ p= population correlation coefficient (unknown) 

e r=sample correlation coefficient (known; calculated from sample 
data) 


The hypothesis test lets us decide whether the value of the population 
correlation coefficient p is "close to 0" or "significantly different from 0". 
We decide this based on the sample correlation coefficient r and the sample 
size n. 


If the test concludes that the correlation coefficient is significantly 
different from 0, we say that the correlation coefficient is "significant". 


e Conclusion: "There is sufficient evidence to conclude that there is a 
significant linear relationship between x and y because the correlation 
coefficient is significantly different from 0." 

e What the conclusion means: There is a significant linear relationship 
between z and y. We can use the regression line to model the linear 
relationship between z and y in the population. 


If the test concludes that the correlation coefficient is not significantly 
different from 0 (it is close to 0), we say that correlation coefficient is 
"not significant". 


¢ Conclusion: "There is insufficient evidence to conclude that there is a 
significant linear relationship between x and y because the correlation 
coefficient is not significantly different from 0." 

e What the conclusion means: There is not a significant linear 
relationship between z and y. Therefore we can NOT use the 
regression line to model a linear relationship between zx and y in the 
population. 


Note: 


e If r is significant and the scatter plot shows a linear trend, the line can 
be used to predict the value of y for values of x that are within the 
domain of observed x values. 

e If ris not significant OR if the scatter plot does not show a linear 
trend, the line should not be used for prediction. 

e If r is significant and if the scatter plot shows a linear trend, the line 
may NOT be appropriate or reliable for prediction OUTSIDE the 
domain of observed z values in the data. 


PERFORMING THE HYPOTHESIS TEST 
SETTING UP THE HYPOTHESES: 


¢ Null Hypothesis: H,: p = 0 
¢ Alternate Hypothesis: H,: p ~ 0 


What the hypotheses mean in words: 


¢ Null Hypothesis H,: The population correlation coefficient IS NOT 
significantly different from 0. There IS NOT a significant linear 
relationship(correlation) between z and y in the population. 

e Alternate Hypothesis H,: The population correlation coefficient IS 
significantly DIFFERENT FROM 0. There IS A SIGNIFICANT 
LINEAR RELATIONSHIP (correlation) between x and y in the 
population. 


DRAWING A CONCLUSION: 


e There are two methods to make the decision. Both methods are 
equivalent and give the same result. 

¢ Method 1: Using the p-value 

e Method 2: Using a table of critical values 

e In this chapter of this textbook, we will always use a significance level 
of 5%, a = 0.05 

e Note: Using the p-value method, you could choose any appropriate 
significance level you want; you are not limited to using a = 0.05. But 
the table of critical values provided in this textbook assumes that we 
are using a significance level of 5%, a = 0.05. (If we wanted to use a 
different significance level than 5% with the critical value method, we 
would need different tables of critical values that are not provided in 
this textbook.) 


METHOD 1: Using a p-value to make a decision 


e The linear regression t-test LinRegTTEST on the TI-83+ or TI-84+ 
calculators calculates the p-value. 

¢ On the LinRegTTEST input screen, on the line prompt for £ or p, 
highlight " 0" 


e The output screen shows the p-value on the line that reads "p =". 
e (Most computer statistical software can calculate the p-value.) 


If the p-value is less than the significance level (a = 0.05): 


e Decision: REJECT the null hypothesis. 

e Conclusion: "There is sufficient evidence to conclude that there is a 
significant linear relationship between x and y because the correlation 
coefficient is significantly different from 0." 


If the p-value is NOT less than the significance level (a = 0.05) 


e Decision: DO NOT REJECT the null hypothesis. 

¢ Conclusion: "There is insufficient evidence to conclude that there is a 
significant linear relationship between x and y because the correlation 
coefficient is NOT significantly different from 0." 


Calculation Notes: 


e You will use technology to calculate the p-value. The following 
describe the calculations to compute the test statistics and the p-value: 

e The p-value is calculated using a ¢-distribution with n-2 degrees of 
freedom. 

r/n—2 
Vi-r? 
Statistic, ¢, is shown in the computer or calculator output along with 
the p-value. The test statistic ¢ has the same sign as the correlation 

coefficient r. 

e The p-value is the combined area in both tails. 

e An altermmative way to calculate the p-value (p) given by LinRegT Test 
is the command 2*tcdf(abs(t),10499, n-2) in 2nd DISTR. 


. The value of the test 


e The formula for the test statistic is = 


THIRD EXAM vs FINAL EXAM EXAMPLE: p value method 


e Consider the third exam/final exam example. 
e The line of best fit is: y = —173.51 + 4.83x with r = 0.6631 and 
there are n = 11 data points. 


¢ Can the regression line be used for prediction? Given a third exam 
score (x value), can we use the line to predict the final exam score 
(predicted y value)? 


« H,: p=0 

e Hy: p40 

¢ a=0.05 

e The p-value is 0.026 (from LinRegTTest on your calculator or from 
computer software) 

e The p-value, 0.026, is less than the significance level of a = 0.05 

¢ Decision: Reject the Null Hypothesis H, 

¢ Conclusion: There is sufficient evidence to conclude that there is a 
significant linear relationship between zx and y because the correlation 
coefficient is significantly different from 0. 

¢ Because r is significant and the scatter plot shows a linear trend, 
the regression line can be used to predict final exam scores. 


METHOD 2: Using a table of Critical Values to make a decision 

The 95% Critical Values of the Sample Correlation Coefficient Table at 
the end of this chapter (before the Summary) may be used to give you a 
good idea of whether the computed value of r is significant or not. 
Compare r to the appropriate critical value in the table. If r is not between 
the positive and negative critical values, then the correlation coefficient is 
significant. If r is significant, then you may want to use the line for 
prediction. 


Example: 

Suppose you computed r = 0.801 using n = 10 data points. 

df = n — 2 = 10 — 2 = 8. The critical values associated with df = 8 are 
-0.632 and + 0.632. If r< negative critical value or 

r > positive critical value, then r is significant. Since r = 0.801 and 
0.801 > 0.632, r is significant and the line may be used for prediction. If 
you view this example on a number line, it will help you. 


-| -0.632 0 +0. 632 +0. 801 +1 


r is not significant between -0.632 and +0.632. 
r = 0.801 > +0.632. Therefore, r is significant. 


Example: 

Suppose you computed r = —0.624 with 14 data points. 

df = 14 — 2 = 12. The critical values are -0.532 and 0.532. Since —0.624 
<—0.532, r is significant and the line may be used for prediction 


-0.6 624 -0.5 -0. 532 +0.532 


r = —0.624<—0.532. Therefore, r is significant. 


Example: 

Suppose you computed r = 0.776 and n = 6. df = 6 — 2 = 4. The 
critical values are -0.811 and 0.811. Since —0.811< 0.776 < 0.811, r is 
not significant and the line should not be used for prediction. 


-0.811 0.776 08 0811 


—0.811<r = 0.776<0.811. Therefore, 7 is not 
significant. 


THIRD EXAM vs FINAL EXAM EXAMPLE: critical value method 


Consider the third exam/final exam example. 

The line of best fit is: ¥ = —173.51 + 4.83x with r = 0.6631 and 
there are n = 11 data points. 

Can the regression line be used for prediction? Given a third exam 
score (x value), can we use the line to predict the final exam score 
(predicted y value)? 


A,: p=0 

Ay: p #0 

a = 0.05 

Use the "95% Critical Value" table for r with 

df =n—2=11-2=9 

The critical values are -0.602 and +0.602 

Since 0.6631 > 0.602, r is significant. 

Decision: Reject H,: 

Conclusion:There is sufficient evidence to conclude that there is a 
significant linear relationship between z and y because the correlation 
coefficient is significantly different from 0. 

Because r is significant and the scatter plot shows a linear trend, 
the regression line can be used to predict final exam scores. 


Example: 

Additional Practice Examples using Critical Values 

Suppose you computed the following correlation coefficients. Using the 
table at the end of the chapter, determine if r is significant and the line of 
best fit associated with each r can be used to predict a y value. If it helps, 
draw a number line. 


1. r = —0.567 and the sample size, n, is 19. The df = n — 2 = 17. The 


critical value is -0.456. —0.567<—0.456 so r is significant. 


2. r = 0.708 and the sample size, n, is 9. The df = n — 2 = 7. The 


critical value is 0.666. 0.708 > 0.666 so 7 is significant. 


3. r = 0.134 and the sample size, n, is 14. The df = 14 — 2 = 12. The 


critical value is 0.532. 0.134 is between -0.532 and 0.532 so r is not 


significant. 
4.r = 0 and the sample size, n, is 5. No matter what the dfs are, r = O 
is between the two critical values so r is not significant. 


Assumptions in Testing the Significance of the Correlation 
Coefficient 


Testing the significance of the correlation coefficient requires that certain 
assumptions about the data are satisfied. The premise of this test is that the 
data are a sample of observed points taken from a larger population. We 
have not examined the entire population because it is not possible or 
feasible to do so. We are examining the sample to draw a conclusion about 
whether the linear relationship that we see between z and y in the sample 
data provides strong enough evidence so that we can conclude that there is a 
linear relationship between z and y in the population. 


The regression line equation that we calculate from the sample data gives 
the best fit line for our particular sample. We want to use this best fit line 
for the sample as an estimate of the best fit line for the population. 
Examining the scatterplot and testing the significance of the correlation 
coefficient helps us determine if it is appropriate to do this. 

The assumptions underlying the test of significance are: 


e There is a linear relationship in the population that models the average 
value of y for varying values of z. In other words, the expected value 
of y for each particular value lies on a straight line in the population. 
(We do not know the equation for the line for the population. Our 
regression line from the sample is our best estimate of this line in the 
population.) 

e The y values for any particular z value are normally distributed about 
the line. This implies that there are more y values scattered closer to 
the line than are scattered farther away. Assumption (1) above implies 
that these normal distributions are centered on the line: the means of 
these normal distributions of y values lie on the line. 


e The standard deviations of the population y values about the line are 
equal for each value of x. In other words, each of these normal 
distributions of y values has the same shape and spread about the line. 

e The residual errors are mutually independent (no pattern). 


The y values for each x value are normally distributed about the line 

with the same standard deviation. For each x value, the mean of the y 

values lies on the regression line. More y values lie near the line than 
are scattered further away from the line. 


**With contributions from Roberta Bloom 


Prediction 

Linear Regression and Correlation: Prediction is a part of Collaborative 
Statistics collection (col10522) by Barbara Illowsky and Susan Dean with 
contributions from Roberta Bloom. 


Recall the third exam/final exam example. 


We examined the scatterplot and showed that the correlation coefficient is 
significant. We found the equation of the best fit line for the final exam 
grade as a function of the grade on the third exam. We can now use the least 
squares regression line for prediction. 


Suppose you want to estimate, or predict, the final exam score of statistics 
students who received 73 on the third exam. The exam scores (z-values) 
range from 65 to 75. Since 73 is between the x-values 65 and 75, 
substitute x = 73 into the equation. Then: 

Equation: 


g = —173.51 + 4.83(73) = 179.08 


We predict that statistic students who earn a grade of 73 on the third exam 
will earn a grade of 179.08 on the final exam, on average. 


Example: 
Recall the third exam/final exam example. 
Exercise: 


Problem: 


What would you predict the final exam score to be for a student who 
scored a 66 on the third exam? 


Solution: 


145.27 


Exercise: 


Problem: 


What would you predict the final exam score to be for a student who 
scored a 90 on the third exam? 


Solution: 


The x values in the data are between 65 and 75. 90 is outside of the 
domain of the observed x values in the data (independent variable), so 
you cannot reliably predict the final exam score for this student. (Even 
though it is possible to enter x into the equation and calculate a y 
value, you should not do so!) 


To really understand how unreliable the prediction can be outside of 
the observed x values in the data, make the substitution x = 90 into the 
equation. 


j = —173.51 + 4.83(90) = 261.19 


The final exam score is predicted to be 261.19. The largest the final 
exam score can be is 200. 


Note:The process of predicting inside of the observed x values in the 
data is called interpolation. The process of predicting outside of the 
observed x values in the data is called extrapolation. 


**With contributions from Roberta Bloom 


Outliers 

Linear Regression and Correlation: Outliers is a part of Collaborative 
Statistics collection (col10522) by Barbara Illowsky and Susan Dean. The 
module has been modified to include a graphical method for identifying 
outliers contributed by Roberta Bloom. 


In some data sets, there are values (observed data points) called outliers. 
Outliers are observed data points that are far from the least squares 
line. They have large "errors", where the "error" or residual is the vertical 
distance from the line to the point. 


Outliers need to be examined closely. Sometimes, for some reason or 
another, they should not be included in the analysis of the data. It is possible 
that an outlier is a result of erroneous data. Other times, an outlier may hold 
valuable information about the population under study and should remain 
included in the data. The key is to carefully examine what causes a data 
point to be an outlier. 


Besides outliers, a sample may contain one or a few points that are called 
influential points. Influential points are observed data points that are far 
from the other observed data points in the horizontal direction. These points 
may have a big effect on the slope of the regression line. To begin to 
identify an influential point, you can remove it from the data set and see if 
the slope of the regression line is changed significantly. 


Computers and many calculators can be used to identify outliers from the 
data. Computer output for regression analysis will often identify both 
outliers and influential points so that you can examine them. 


Identifying Outliers 

We could guess at outliers by looking at a graph of the scatterplot and best 
fit line. However we would like some guideline as to how far away a point 
needs to be in order to be considered an outlier. As a rough rule of thumb, 
we can flag any point that is located further than two standard 
deviations above or below the best fit line as an outlier. The standard 
deviation used is the standard deviation of the residuals or errors. 


We can do this visually in the scatterplot by drawing an extra pair of lines 
that are two standard deviations above and below the best fit line. Any data 
points that are outside this extra pair of lines are flagged as potential 
outliers. Or we can do this numerically by calculating each residual and 
comparing it to twice the standard deviation. On the TI-83, 83+, or 84+, the 
graphical approach is easier. The graphical procedure is shown first, 
followed by the numerical calculations. You would generally only need to 
use one of these methods. 


Example: 
Exercise: 


Problem: 


In the third exam/final exam example, you can determine if there is an 
outlier or not. If there is an outlier, as an exercise, delete it and fit the 
remaining data to a new line. For this example, the new line ought to 
fit the remaining data better. This means the SSE should be smaller 
and the correlation coefficient ought to be closer to 1 or -1. 


Solution: 


Graphical Identification of Outliers 

With the TI-83,83+,84+ graphing calculators, it is easy to identify the 
outlier graphically and visually. If we were to measure the vertical 
distance from any data point to the corresponding point on the line of 
best fit and that distance was equal to 2s or farther, then we would 
consider the data point to be "too far" from the line of best fit. We 
need to find and graph the lines that are two standard deviations 
below and above the regression line. Any points that are outside these 
two lines are outliers. We will call these lines Y2 and Y3: 


As we did with the equation of the regression line and the correlation 
coefficient, we will use technology to calculate this standard deviation 
for us. Using the LinRegT Test with this data, scroll down through the 
output screens to find s=16.412 


Line Y2=-173.5+4.832%-2(16.4) and line Y3=-173.5+4.832+2(16.4) 


where Y=-173.5+4.83x is the line of best fit. Y2 and Y3 have the same 
slope as the line of best fit. 


Graph the scatterplot with the best fit line in equation Y1, then enter 
the two extra lines as Y2 and Y3 in the "Y="equation editor and press 
ZOOM 9. You will find that the only data point that is not between 
lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it 
is just barely outside these lines. The outlier is the student who had a 
grade of 65 on the third exam and 175 on the final exam; this point is 
further than 2 standard deviations away from the best fit line. 


Sometimes a point is so close to the lines used to flag outliers on the 
graph that it is difficult to tell if the point is between or outside the 
lines. On a computer, enlarging the graph may help; on a small 
calculator screen, zooming in may make the graph clearer. Note that 
when the graph does not give a clear enough picture, you can use the 
numerical comparisons to identify outliers. 

[missing_resource: linrgoutlier.gif] 


Numerical Identification of Outliers 

In the table below, the first two columns are the third exam and final 
exam data. The third column shows the predicted ¥ values calculated 
from the line of best fit: y=-173.5+4.83x. The residuals, or errors, 
have been calculated in the fourth column of the table: 

observed y value — predicted y value = y — y. 


s is the standard deviation of all the y — y = e€ values where n = the 
total number of data points. If each residual is calculated and squared, 
and the results are added, we get the SSE. The standard deviation of 
the residuals is calculated from the SSE as: 


— , f See 
= n—2 


Rather than calculate the value of s ourselves, we can find s using the 
computer or calculator. For this example, the calculator function 


LinRegTTest found s = 16.4 as the standard deviation of the residuals 
35 -17 16 -6 -1993 -1-10-9-1. 


x y y y-y 

65 175 140 175 — 140 = 35 
67 133 150 133 — 150 = —17 
71 185 169 185 — 169 = 16 
71 163 169 163 — 169 = —6 
66 126 145 126 — 145 = —-19 
75 198 189 198 — 189 = 9 

67 153 150 153 —150=3 

70 163 164 163 — 164 = -1 
71 159 169 159 — 169 = —10 
69 151 160 151 — 160 = —9 
69 159 160 159 — 160 = -1 


We are looking for all data points for which the residual is greater 
than 2s=2(16.4)=32.8 or less than -32.8. Compare these values to the 
residuals in column 4 of the table. The only such data point is the 
student who had a grade of 65 on the third exam and 175 on the final 
exam; the residual for this student is 35. 


How does the outlier affect the best fit line? 

Numerically and graphically, we have identified the point (65,175) as 
an outlier. We should re-examine the data for this point to see if there 
are any problems with the data. If there is an error we should fix the 
error if possible, or delete the data. If the data is correct, we would 
leave it in the data set. For this problem, we will suppose that we 
examined the data and found that this outlier data was an error. 
Therefore we will continue on and delete the outlier, so that we 
can explore how it affects the results, as a learning experience. 


Compute a new best-fit line and correlation coefficient using the 
10 remaining points: 

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 
and L2. Using the LinRegTTest, the new line of best fit and the 
correlation coefficient are: 


y = —355.19 + 7.39x andr = 0.9121 


The new line with r = 0.9121 is a stronger correlation than the 
original (r=0.6631) because r = 0.9121 is closer to 1. This means 
that the new line is a better fit to the 10 remaining data values. The 
line can better predict the final exam score given the third exam score. 


Numerical Identification of Outliers: Calculating s and Finding 
Outliers Manually 


If you do not have the function LinRegTTest, then you can calculate the 
outlier in the first example by doing the following. 


First, square each |y — ¢| (See the TABLE above): 
The squares are 352 177 162 6? 192 9? 3? 17 10? 9? 1? 


Then, add (sum) all the |y — g| squared terms using the formula 


A 


Yi — Vi 


11 
* ( 
i=l 


= 357+ 177+ 167+ 624+ 1974+ 97+ 324+ 12741074974 12 


11 
i = X ei” (Recall that y; — Y = &.) 


= 2440 = SSE. The result, SSE is the Sum of Squared Errors. 


Next, calculate s, the standard deviation of all the y — 4 = € values 
where n = the total number of data points. 


The calculation is s = 4/ SSE. 


For the third exam/final exam problem, s = / aa = 16.47 


Next, multiply s by 1.9: 

(1.9) - (16.47) = 31.29 

31.29 is almost 2 standard deviations away from the mean of the y — y 
values. 


If we were to measure the vertical distance from any data point to the 
corresponding point on the line of best fit and that distance is at least 1.9s, 
then we would consider the data point to be "too far" from the line of best 
fit. We call that point a potential outlier. 


For the example, if any of the |y — | values are at least 31.29, the 
corresponding (, y) data point is a potential outlier. 


For the third exam/final exam problem, all the ly — y's are less than 31.29 


except for the first one which is 35. 
y—§| > (1.9) - (s) 


The point which corresponds to |y — y| = 35 is (65, 175). Therefore, the 
data point (65, 175) is a potential outlier. For this example, we will delete 
it. (Remember, we do not always delete an outlier.) 


35 > 31.29 That is, 


The next step is to compute a new best-fit line using the 10 remaining 
points. The new line of best fit and the correlation coefficient are: 


Yy = —355.19 + 7.39x2 and r = 0.9121 


Example: 
Exercise: 


Problem: 


Using this new line of best fit (based on the remaining 10 data points), 
what would a student who receives a 73 on the third exam expect to 
receive on the final exam? Is this the same as the prediction made 
using the original line? 


Solution: 


Using the new line of best fit, 7 = —355.19 + 7.39(73) = 184.28. A 
student who scored 73 points on the third exam would expect to earn 
184 points on the final exam. 


The original line predicted ¥ = —173.51 + 4.83(73) = 179.08 so 
the prediction using the new line with the outlier eliminated differs 
from the original prediction. 


Example: 

(From The Consumer Price Indexes Web site) The Consumer Price Index 
(CPI) measures the average change over time in the prices paid by urban 
consumers for consumer goods and services. The CPI affects nearly all 
Americans because of the many ways it is used. One of its biggest uses is 
as a measure of inflation. By providing information about price changes in 
the Nation's economy to government, business, and labor, the CPI helps 
them to make economic decisions. The President, Congress, and the 


Federal Reserve Board use the CPI's trends to formulate monetary and 
fiscal policies. In the following table, x is the year and y is the CPI. 


x 7] 
1915 10.1 
1926 leew 
1935 ee, 
1940 14.7 
1947 24.1 
1952 26.5 
1964 31.0 
1969 36.7 
1975 49.3 
1979 72.6 
1980 82.4 
1986 109.6 
1991 130.7 


Data: 


Exercise: 


Problem: 


e Make a scatterplot of the data. 

e Calculate the least squares line. Write the equation in the form 
G—a-+ be. 

e Draw the line on the scatterplot. 

Find the correlation coefficient. Is it significant? 

What is the average CPI for the year 1990? 


Solution: 


e Scatter plot and line of best fit. 

° y = —3204 + 1.6622 is the equation of the line of best fit. 

r = 0.8694 

e The number of data points isn = 14. Use the 95% Critical 

Values of the Sample Correlation Coefficient table at the end of 

Chapter 12. n — 2 = 12. The corresponding critical value is 

0.532. Since 0.8694 > 0.532, r is significant. 

gj = —3204 + 1.662(1990) = 103.4 CPI 

e Using the calculator LinRegTTest, we find that s = 25.4 ; 
graphing the lines Y2=-3204+1.662X-2(25.4) and 
Y3=-3204+1.662X+2(25.4) shows that no data values are outside 
those lines, identifying no outliers. (Note that the year 1999 was 
very close to the upper line, but still inside it.) 


CPI 
Oo 
na 


11 


14 900 1911 1922 1933 1944 1955 1966 1977 1988 19992010 
Year 


Note:In the example, notice the pattern of the points compared to the line. 
Although the correlation coefficient is significant, the pattern in the 
scatterplot indicates that a curve would be a more appropriate model to 
use than a line. In this example, a statistician should prefer to use other 
methods to fit a curve to this data, rather than model the data with the line 
we found. In addition to doing the calculations, it is always important to 
look at the scatterplot when deciding whether a linear model is 
appropriate. 


If you are interested in seeing more years of data, visit the Bureau of Labor 
Statistics CPI website ftp://ftp.bls.go0v/pub/special.requests/cpi/cpiai.txt ; 
our data is taken from the column entitled "Annual Avg." (third column 
from the right). For example you could add more current years of data. Try 
adding the more recent years 2004 : CPI=188.9, 2008 : CPI=215.3 and 
2011: CPI=224.9. See how it affects the model. (Check: 

y = —4436 + 2.295z. r = 0.9018. Is r significant? Is the fit better with 
the addition of the new points?) 


**With contributions from Roberta Bloom 


Glossary 


Outlier 
An observation that does not fit the rest of the data. 


95% Critical Values of the Sample Correlation Coefficient Table 

This module provides an overview of Linear Regression and Correlation: 
95% Critical Values of the Sample Correlation Coefficient Table as a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 


Degrees of Freedom: n — 2 Critical Values: (+ and —) 
1: 0.997 
Z 0.950 
3 0.878 
4 0.811 
5 0.754 
6 0.707 
7 0.666 
8 0.632 
9 0.602 
10 0.576 
11 0.555 


12 0.532 


Degrees of Freedom: n — 2 Critical Values: (+ and —) 


13 0.514 
14 0.497 
15 0.482 
16 0.468 
17 0.456 
18 0.444 
19 0.433 
20 0.423 
21 0.413 
22 0.404 
23 0.396 
24 0.388 
25 0.381 
26 0.374 
27 0.367 
28 0.361 


29 0.355 


Degrees of Freedom: n — 2 Critical Values: (+ and —) 


30 0.349 
40 0.304 
50 0.273 
60 0.250 
70 0.232 
80 0.217 
90 0.205 


100 0.195 


Linear Regression and Correlation: Summary 

Bivariate Data: Each data point has two values. The form is (z, y). 
Line of Best Fit or Least Squares Line (LSL): 4 = a + bx 

x = independent variable; y = dependent variable 


Residual: Actual y value — predicted y value = y — y 
Correlation Coefficient r: 


1. Used to determine whether a line of best fit is good for prediction. 

2. Between -1 and 1 inclusive. The closer r is to 1 or -1, the closer the 
original points are to a straight line. 

3. If r is negative, the slope is negative. If r is positive, the slope is 
positive. 

4. If r = 0, then the line is horizontal. 


Sum of Squared Errors (SSE): The smaller the SSE, the better the 
original set of points fits the line of best fit. 


Outlier: A point that does not seem to fit the rest of the data. 


Practice: Linear Regression 

This module provides a practice of Linear Regression and Correlation as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


Student Learning Outcomes 


e The student will evaluate bivariate data and determine if a line is an 
appropriate fit to the data. 


Given 


Below are real data for the first two decades of AIDS reporting. (Source: 
Centers for Disease Control and Prevention, National Center for HIV, STD, 
and TB Prevention) 


Year # AIDS cases diagnosed # AIDS deaths 
Pre-1981 91 29 

1981 319 121 

1982 1,170 453 

1983 3,076 1,482 

1984 6,240 3,466 

1985 11,776 6,878 


1986 19,032 11,987 


1987 28,564 16,162 


1988 35,447 20,868 
1989 42,674 2759 | 
1990 48,634 31,335 
1991 59,660 36,560 
1992 78,530 41,055 
1993 78,834 44,730 
1994 71,874 49,095 
1995 68,505 49,456 
1996 Do,047 38,510 
1997 47,149 20,736 
1998 38,393 19,005 
1999 25,174 18,454 
2000 20,02 2 17,347 
2001 25,643 17,402 
2002 26,464 16,371 
Total 802,118 489,093 


Adults and Adolescents only, United States 


Note:We will use the columns “year” and “# AIDS cases diagnosed” for 
all questions unless otherwise stated. 


Graphing 
Graph “year” vs. “# AIDS cases diagnosed.” Plot the points on the graph 


located below in the section titled "Plot" . Do not include pre-1981. 
Label both axes with words. Scale both axes. 


Data 


Exercise: 


Problem: 
Enter your data into your calculator or computer. The pre-1981 data 
should not be included. Why is that so? 

Linear Equation 


Write the linear equation below, rounding to 4 decimal places: 


Note:For any prediction questions, the answers are calculated using the 
least squares (best fit) line equation cited in the solution. 


Exercise: 


Problem: Calculate the following: 


eaa— 
a 


* @ Corr: = 
e dn =(# of pairs) 


Solution: 


© aa = -3,448,225 


¢e bb= 1750 

e ccorr. = 0.4526 

eal = 22 
Exercise: 


Problem: equation: 7 = 


Solution: 


y = -3,448,225 +1750x 


Solve 


Exercise: 


Problem: Solve. 


e a When x = 1985, y = 

¢ b When x = 1990, y = 
Solution: 

© a 25,525 


e b34,275 


Plot 


Plot the 2 above points on the graph below. Then, connect the 2 points to 
form the regression line. 


Obtain the graph on your calculator or computer. 


Discussion Questions 


Look at the graph above. 
Exercise: 


Problem: Does the line seem to fit the data? Why or why not? 


Exercise: 


Problem: Do you think a linear fit is best? Why or why not? 
Exercise: 

Problem: 

Hand draw a smooth curve on the graph above that shows the flow of 

the data. 


Exercise: 


Problem: 


What does the correlation imply about the relationship between time 
(years) and the number of diagnosed AIDS cases reported in the U.S.? 


Exercise: 


Problem: 


Why is “year” the independent variable and “# AIDS cases 
diagnosed.” the dependent variable (instead of the reverse)? 


Exercise: 
Problem: Solve. 
e aWhen x = 1970, 9 =: 
e bWhy doesn’t this answer make sense? 
Solution: 


e a-/25 


Homework 

Linear Regression and Correlation: Homework is a part of Collaborative Statistics collection (col10522) by 
Barbara Illowsky and Susan Dean. 

Exercise: 


Problem: For each situation below, state the independent variable and the dependent variable. 


e aA study is done to determine if elderly drivers are involved in more motor vehicle fatalities than 
all other drivers. The number of fatalities per 100,000 drivers is compared to the age of drivers. 

e bA study is done to determine if the weekly grocery bill changes based on the number of family 
members. 

e cInsurance companies base life insurance premiums partially on the age of the applicant. 

dUtility bills vary according to power consumption. 

eA study is done to determine if a higher education reduces the crime rate in a population. 


Solution: 


e alndependent: Age; Dependent: Fatalities 
e dIndependent: Power Consumption; Dependent: Utility 


Note:For any prediction questions, the answers are calculated using the least squares (best fit) line equation 
cited in the solution. 


Exercise: 
Problem: 


Recently, the annual number of driver deaths per 100,000 for the selected age groups was as follows 
(Source: http:// 


http://www.census.gov/compendia/statab/cats/transportation/motor_vehicle_accidents_and_fatalities.html): 


Age Number of Driver Deaths per 100,000 
16-19 38 
20-24 36 
25-34 24 
35-54 20 


55-74 18 


Age Number of Driver Deaths per 100,000 


75+ 28 


aFor each age group, pick the midpoint of the interval for the x value. (For the 75+ group, use 80.) 
bUsing “ages” as the independent variable and “Number of driver deaths per 100,000” as the 
dependent variable, make a scatter plot of the data. 

cCalculate the least squares (best—fit) line. Put the equation in the form of: 7 = a + bx 

dFind the correlation coefficient. Is it significant? 

ePick two ages and find the estimated fatality rates. 

fUse the two points in (e) to plot the least squares line on your graph from (b). 

gBased on the above data, is there a linear relationship between age of a driver and driver fatality 
rate? 

hWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Exercise: 


Problem: 


The average number of people in a family that received welfare for various years is given below. 
(Source: House Ways and Means Committee, Health and Human Services Department) 


Year Welfare family size 
1969 4.0 
1973 3.6 
1975 3.2 
1979 3.0 
1983 3.0 
1988 3.0 
1991 2.9 


aUsing “year” as the independent variable and “welfare family size” as the dependent variable, 
make a scatter plot of the data. 

bCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

cFind the correlation coefficient. Is it significant? 

dPick two years between 1969 and 1991 and find the estimated welfare family sizes. 

eUse the two points in (d) to plot the least squares line on your graph from (b). 

fBased on the above data, is there a linear relationship between the year and the average number of 
people in a welfare family? 


gUsing the least squares line, estimate the welfare family sizes for 1960 and 1995. Does the least 
squares line give an accurate estimate for those years? Explain why or why not. 

e hAre there any outliers in the above data? 

e iWhat is the estimated average welfare family size for 1986? Does the least squares line give an 
accurate estimate for that year? Explain why or why not. 

jWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Solution: 


¢ by = 88.7206 — 0.04322 

e c-0.8533, Yes 

e gNo 

hNo. 

12.93, Yes 

e jslope = -0.0432. As the year increases by one, the welfare family size tends to decrease by 0.0432 
people. 


Exercise: 
Problem: 
Use the AIDS data from the practice for this section, but this time use the columns “year #” and “# new 
AIDS deaths in U.S.” Answer all of the questions from the practice again, using the new columns. 
Exercise: 
Problem: 


The height (sidewalk to roof) of notable tall buildings in America is compared to the number of stories 
of the building (beginning at street level). (Source: Microsoft Bookshelf) 


Height (in feet) Stories 
1050 57 
428 28 
362 26 
529 AO 
790 60 
401 22 
380 38 


1454 110 


Height (in feet) Stories 
1127 100 


700 46 


e aUsing “stories” as the independent variable and “height” as the dependent variable, make a scatter 
plot of the data. 

e bDoes it appear from inspection that there is a relationship between the variables? 

e cCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

dFind the correlation coefficient. Is it significant? 

eFind the estimated heights for 32 stories and for 94 stories. 

fUse the two points in (e) to plot the least squares line on your graph from (b). 

gBased on the above data, is there a linear relationship between the number of stories in tall 

buildings and the height of the buildings? 

e hAre there any outliers in the above data? If so, which point(s)? 

e iWhat is the estimated height of a building with 6 stories? Does the least squares line give an 
accurate estimate of height? Explain why or why not. 

e jBased on the least squares line, adding an extra story is predicted to add about how many feet to a 

building? 

kWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Solution: 


e bYes 

e cy = 102.4287 + 11.75852 

e d0.9436; yes 

e e478.70 feet; 1207.73 feet 

e gYes 

hYes; (57,1050) 

e 4172.98; No 

e j11.7585 feet 

kslope = 11.7585. As the number of stories increases by one, the height of the building tends to 
increase by 11.7585 feet. 


Exercise: 


Problem: 


Below is the life expectancy for an individual born in the United States in certain years. (Source: 
National Center for Health Statistics) 


Year of Birth Life Expectancy 


1930 59.7 


Year of Birth 


1940 


1950 


1965 


1973 


1982 


1987 


1992 


2010 


Life Expectancy 


62.9 


e aDecide which variable should be the independent variable and which should be the dependent 


variable. 


bDraw a scatter plot of the ordered pairs. 

cCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

dFind the correlation coefficient. Is it significant? 

eFind the estimated life expectancy for an individual born in 1950 and for one born in 1982. 
fWhy aren’t the answers to part (e) the values on the above chart that correspond to those years? 
gUse the two points in (e) to plot the least squares line on your graph from (b). 

hBased on the above data, is there a linear relationship between the year of birth and life 
expectancy? 

iAre there any outliers in the above data? 

jUsing the least squares line, find the estimated life expectancy for an individual born in 1850. 


Does the least squares line give an accurate estimate for that year? Explain why or why not. 


Exercise: 


Problem: 


kWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


The percent of female wage and salary workers who are paid hourly rates is given below for the years 
1979 - 1992. (Source: Bureau of Labor Statistics, U.S. Dept. of Labor) 


Year 


1979 


1980 


1981 


Percent of workers paid hourly rates 
61.2 
60.7 


61.3 


Year Percent of workers paid hourly rates 
1982 61.3 
1983 61.8 
1984 61.7 
1985 61.8 
1986 62.0 
1987 62.7 
1990 62.8 
1992 62.9 


e aUsing “year” as the independent variable and “percent” as the dependent variable, make a scatter 
plot of the data. 

bDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 


cCalculate the least squares line. Put the equation in the form of: 7 = a + bx 


e dFind the correlation coefficient. Is it significant? 


eFind the estimated percents for 1991 and 1988. 
fUse the two points in (e) to plot the least squares line on your graph from (b). 


e gBased on the above data, is there a linear relationship between the year and the percent of female 
wage and salary earners who are paid hourly rates? 
e hAre there any outliers in the above data? 


iWhat is the estimated percent for the year 2050? Does the least squares line give an accurate 


estimate for that year? Explain why or why not? 


Solution: 


bYes 

c Y = —266.8863 + 0.1656z 

d0.9448; Yes 

e62.8233; 62.3265 

h yes; (1987, 62.7) 

172.5937; No 

jslope = 0.1656. As the year increases by one, the percent of workers paid hourly rates tends to 


jWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


increase by 0.1656. 


Exercise: 


Problem: 


The maximum discount value of the Entertainment® card for the “Fine Dining” section, Edition 10, for 
various pages is given below. 


Page number Maximum value ($) 


4 16 
14 19 
25 15 
32 17 
43 19 
57 15 
72 16 
85 15 
90 17 


aDecide which variable should be the independent variable and which should be the dependent 
variable. 

bDraw a scatter plot of the ordered pairs. 

cCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

dFind the correlation coefficient. Is it significant? 

eFind the estimated maximum values for the restaurants on page 10 and on page 70. 

fUse the two points in (e) to plot the least squares line on your graph from (b). 

gDoes it appear that the restaurants giving the maximum value are placed in the beginning of the 
“Fine Dining” section? How did you arrive at your answer? 

hSuppose that there were 200 pages of restaurants. What do you estimate to be the maximum value 
for a restaurant listed on page 200? 

ils the least squares line valid for page 200? Why or why not? 

jWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


The next two questions refer to the following data: The cost of a leading liquid laundry detergent in 
different sizes is given below. 


Size (ounces) Cost ($) Cost per ounce 
16 3.99 
32 4.99 
64 5.99 


200 


10.99 


Exercise: 


Problem: 


aUsing “size” as the independent variable and “cost” as the dependent variable, make a scatter plot. 
bDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

cCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

dFind the correlation coefficient. Is it significant? 

elf the laundry detergent were sold in a 40 ounce size, find the estimated cost. 

ff the laundry detergent were sold in a 90 ounce size, find the estimated cost. 

gUse the two points in (e) and (f) to plot the least squares line on your graph from (a). 

hDoes it appear that a line is the best way to fit the data? Why or why not? 

iAre there any outliers in the above data? 

jis the least squares line valid for predicting what a 300 ounce size of the laundry detergent would 
cost? Why or why not? 

kWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Solution: 


bYes 

cy = 3.5984 + 0.03712 

d0.9986; Yes 

e$5.08 

£$6.93 

iNo 

jNot valid 

kslope = 0.0371. As the number of ounces increases by one, the cost of liquid detergent tends to 
increase by $0.0371 or is predicted to increase by $0.0371 (about 4 cents). 


Exercise: 


Problem: 


aComplete the above table for the cost per ounce of the different sizes. 

bUsing “Size” as the independent variable and “Cost per ounce” as the dependent variable, make a 
scatter plot of the data. 

cDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

dCalculate the least squares line. Put the equation in the form of: y = a + bx 

eFind the correlation coefficient. Is it significant? 

ff the laundry detergent were sold in a 40 ounce size, find the estimated cost per ounce. 

gif the laundry detergent were sold in a 90 ounce size, find the estimated cost per ounce. 

hUse the two points in (f) and (g) to plot the least squares line on your graph from (b). 

iDoes it appear that a line is the best way to fit the data? Why or why not? 

jAre there any outliers in the above data? 

kIs the least squares line valid for predicting what a 300 ounce size of the laundry detergent would 
cost per ounce? Why or why not? 

IWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Exercise: 


Problem: 


According to flyer by a Prudential Insurance Company representative, the costs of approximate probate 
fees and taxes for selected net taxable estates are as follows: 


Net Taxable Estate ($) Approximate Probate Fees and Taxes ($) 
600,000 30,000 

750,000 92,500 

1,000,000 203,000 

1,500,000 438,000 

2,000,000 688,000 

2,900,000 1,037,000 

3,000,000 1,350,000 


aDecide which variable should be the independent variable and which should be the dependent 
variable. 

bMake a scatter plot of the data. 

cDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

eFind the correlation coefficient. Is it significant? 

fFind the estimated total cost for a net taxable estate of $1,000,000. Find the cost for $2,500,000. 
gUse the two points in (f) to plot the least squares line on your graph from (b). 

hDoes it appear that a line is the best way to fit the data? Why or why not? 

iAre there any outliers in the above data? 

jBased on the above, what would be the probate fees and taxes for an estate that does not have any 
assets? 

kWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Solution: 


cYes 

dy = —337,424.6478 + 0.54632 

e0.9964; Yes 

£$208,875.35; $1,028,325.35 

hYes 

iNo 

kslope = 0.5463. As the net taxable estate increases by one dollar, the approximate probate fees and 
taxes tend to increase by 0.5463 dollars (about 55 cents). 


Exercise: 


Problem: The following are advertised sale prices of color televisions at Anderson’s. 


Size (inches) Sale Price ($) 
9 147 

20 197 

27 297 

31 447 

35 1177 

40 2177 

60 2497 


e aDecide which variable should be the independent variable and which should be the dependent 
variable. 

e bMake a scatter plot of the data. 

¢ cDoes it appear from inspection that there is a relationship between the variables? Why or why 

not? 

dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

eFind the correlation coefficient. Is it significant? 

fFind the estimated sale price for a 32 inch television. Find the cost for a 50 inch television. 

gUse the two points in (f) to plot the least squares line on your graph from (b). 

hDoes it appear that a line is the best way to fit the data? Why or why not? 

iAre there any outliers in the above data? 

jWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Exercise: 


Problem: Below are the average heights for American boys. (Source: Physician’s Handbook, 1990) 


Age (years) Height (cm) 
birth 50.8 


2 83.8 


Age (years) Height (cm) 


3 91.4 
5 106.6 
7 119.3 
10 137.1 
14 157.5 


e aDecide which variable should be the independent variable and which should be the dependent 
variable. 

¢ bMake a scatter plot of the data. 

e cDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

e dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

e eFind the correlation coefficient. Is it significant? 

e fFind the estimated average height for a one year-old. Find the estimated average height for an 

eleven year-old. 

gUse the two points in (f) to plot the least squares line on your graph from (b). 

hDoes it appear that a line is the best way to fit the data? Why or why not? 

iAre there any outliers in the above data? 

jUse the least squares line to estimate the average height for a sixty-two year-old man. Do you 

think that your answer is reasonable? Why or why not? 

e kWhat is the slope of the least squares (best-fit) line? Interpret the slope. 


Solution: 


e cYes 

°e dy = 65.0876 + 7.0948z 

e e0.9761; yes 

£72.2 cm; 143.13 cm 

e hYes 

iNo 

j505.0 cm; No 

e kslope = 7.0948. As the age of an American boy increases by one year, the average height tends to 
increase by 7.0948 cm. 


Exercise: 


Problem: 


The following chart gives the gold medal times for every other Summer Olympics for the women’s 100 
meter freestyle (swimming). 


Year Time (seconds) 


1912 82.2 
1924 72.4 
1932 66.8 
1952 66.8 
1960 61.2 
1968 60.0 
1976 55.65 
1984 55.92 
1992 54.64 
2000 53.8 
2008 53.1 


e aDecide which variable should be the independent variable and which should be the dependent 
variable. 

¢ bMake a scatter plot of the data. 

¢ cDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

e dCalculate the least squares line. Put the equation in the form of: y = a + bx 

e eFind the correlation coefficient. Is the decrease in times significant? 

e fFind the estimated gold medal time for 1932. Find the estimated time for 1984. 

e gWhy are the answers from (f) different from the chart values? 

hUse the two points in (f) to plot the least squares line on your graph from (b). 

iDoes it appear that a line is the best way to fit the data? Why or why not? 

e jUse the least squares line to estimate the gold medal time for the next Summer Olympics. Do you 
think that your answer is reasonable? Why or why not? 


The next three questions use the following state information. 


# letters in Year entered the Rank for entering the Area (square 
State name Union Union miles) 
Alabama 7 1819 22 52,423 


Colorado 1876 38 104,100 


# letters in Year entered the Rank for entering the Area (square 


State name Union Union miles) 
Hawaii 1959 50 10,932 
Iowa 1846 29 56,276 
Maryland 1788 7 12,407 
Missouri 1821 24 69,709 
ae 1787 3 8,722 
Ohio 1803 17 44,828 
aoe 13 1788 8 32,008 
Utah 1896 45 84,904 
Wisconsin 1848 30 65,499 
Exercise: 
Problem: 


We are interested in whether or not the number of letters in a state name depends upon the year the state 
entered the Union. 


e aDecide which variable should be the independent variable and which should be the dependent 
variable. 

bMake a scatter plot of the data. 

¢ cDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

e eFind the correlation coefficient. What does it imply about the significance of the relationship? 
e fFind the estimated number of letters (to the nearest integer) a state would have if it entered the 
Union in 1900. Find the estimated number of letters a state would have if it entered the Union in 
1940. 

gUse the two points in (f) to plot the least squares line on your graph from (b). 

e hDoes it appear that a line is the best way to fit the data? Why or why not? 

iUse the least squares line to estimate the number of letters a new state that enters the Union this 
year would have. Can the least squares line be used to predict it? Why or why not? 


Solution: 


e cNo 

e dy = 47.03 — 0.0216 
e e-0.4280 

e f6;5 


Exercise: 


Problem: 


We are interested in whether there is a relationship between the ranking of a state and the area of the 
state. 


aLet rank be the independent variable and area be the dependent variable. 

bWhat do you think the scatter plot will look like? Make a scatter plot of the data. 

cDoes it appear from inspection that there is a relationship between the variables? Why or why 
not? 

dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

eFind the correlation coefficient. What does it imply about the significance of the relationship? 
fFind the estimated areas for Alabama and for Colorado. Are they close to the actual areas? 

gUse the two points in (f) to plot the least squares line on your graph from (b). 

hDoes it appear that a line is the best way to fit the data? Why or why not? 

iAre there any outliers? 

jUse the least squares line to estimate the area of a new state that enters the Union. Can the least 
squares line be used to predict it? Why or why not? 

kDelete “Hawaii” and substitute “Alaska” for it. Alaska is the fortieth state with an area of 656,424 
square miles. 

ICalculate the new least squares line. 

mFind the estimated area for Alabama. Is it closer to the actual area with this new least squares line 
or with the previous one that included Hawaii? Why do you think that’s the case? 

nDo you think that, in general, newer states are larger than the original states? 


Exercise: 


Problem: 


We are interested in whether there is a relationship between the rank of a state and the year it entered the 
Union. 


aLet year be the independent variable and rank be the dependent variable. 

bWhat do you think the scatter plot will look like? Make a scatter plot of the data. 

cWhy must the relationship be positive between the variables? 

dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

eFind the correlation coefficient. What does it imply about the significance of the relationship? 
fLet’s say a fifty-first state entered the union. Based upon the least squares line, when should that 
have occurred? 

gUsing the least squares line, how many states do we currently have? 

hWhy isn’t the least squares line a good estimator for this year? 


Solution: 


dj = —480.5845 + 0.27482 
e0.9553 
£1934 


Exercise: 


Problem: 


Below are the percents of the U.S. labor force (excluding self-employed and unemployed ) that are 
members of a union. We are interested in whether the decrease is significant. (Source: Bureau of Labor 
Statistics, U.S. Dept. of Labor) 


Year Percent 
1945 35.5 
1950 31.5 
1960 31.4 
1970 27.3 
1980 21.9 
1993 15.8 
2011 11.8 


e aLet year be the independent variable and percent be the dependent variable. 

¢ bWhat do you think the scatter plot will look like? Make a scatter plot of the data. 

cWhy will the relationship between the variables be negative? 

dCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

eFind the correlation coefficient. What does it imply about the significance of the relationship? 
fBased on your answer to (e), do you think that the relationship can be said to be decreasing? 
gif the trend continues, when will there no longer be any union members? Do you think that will 
happen? 


The next two questions refer to the following information: The data below reflects the 1991-92 Reunion 
Class Giving. (Source: SUNY Albany alumni magazine) 


Class Year Average Gift Total Giving 
1922 41.67 125 
1927 60.75 1,215 


1932 83.82 3,772 


Class Year Average Gift Total Giving 


1937 87.84 5,710 
1947 88.27 6,003 
1952 76.14 5,254 
1957 52.29 4,393 
1962 57.80 4,451 
1972 42.68 18,093 
1976 49.39 22,473 
1981 46.87 20,997 
1986 37.03 12,590 
Exercise: 
Problem: 


We will use the columns “class year” and “total giving” for all questions, unless otherwise stated. 


aWhat do you think the scatter plot will look like? Make a scatter plot of the data. 

bCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

cFind the correlation coefficient. What does it imply about the significance of the relationship? 
dFor the class of 1930, predict the total class gift. 

eFor the class of 1964, predict the total class gift. 

fFor the class of 1850, predict the total class gift. Why doesn’t this value make any sense? 


Solution: 


b 7 = —569,770.2796 + 296.0351 
c0.8302 

d$1577.46 

e$11,642.66 

f-$22,105.34 


Exercise: 


Problem: 


We will use the columns “class year” and “average gift” for all questions, unless otherwise stated. 


aWhat do you think the scatter plot will look like? Make a scatter plot of the data. 

bCalculate the least squares line. Put the equation in the form of: 7 = a + bx 

cFind the correlation coefficient. What does it imply about the significance of the relationship? 
dFor the class of 1930, predict the average class gift. 

eFor the class of 1964, predict the average class gift. 

fFor the class of 2010, predict the average class gift. Why doesn’t this value make any sense? 


Exercise: 


Problem: 


We are interested in exploring the relationship between the weight of a vehicle and its fuel efficiency 
(gasoline mileage). The data in the table show the weights, in pounds, and fuel efficiency, measured in 
miles per gallon, for a sample of 12 vehicles. 


Weight Fuel Efficiency 
2715 24 
2570 28 
2610 29 
2750 38 
3000 25 
3410 22 
3640 20 
3700 26 
3880 21 
3900 18 
4060 18 
4710 15 


aGraph a scatterplot of the data. 

bFind the correlation coefficient and determine if it is significant. 

cFind the equation of the best fit line. 

dWrite the sentence that interprets the meaning of the slope of the line in the context of the data. 
eWhat percent of the variation in fuel efficiency is explained by the variation in the weight of the 
vehicles, using the regression line? (State your answer in a complete sentence in the context of the 
data.) 

fAccurately graph the best fit line on your scatterplot. 

gFor the vehicle that weights 3000 pounds, find the residual (y-yhat). Does the value predicted by 
the line underestimate or overestimate the observed data value? 

hidentify any outliers, using either the graphical or numerical procedure demonstrated in the 
textbook. 

iThe outlier is a hybrid car that runs on gasoline and electric technology, but all other vehicles in 
the sample have engines that use gasoline only. Explain why it would be appropriate to remove the 


outlier from the data in this situation. Remove the outlier from the sample data. Find the new 
correlation coefficient, coefficient of determination, and best fit line. 

e jCompare the correlation coefficients and coefficients of determination before and after removing 
the outlier, and explain in complete sentences what these numbers indicate about how the model 
has changed. 


Solution: 


br = -0.8, significant 

cyhat = 48.4-0.00725x 

dFor every one pound increase in weight, the fuel efficiency tends to decrease (or is predicted to 
decrease) by 0.00725 miles per gallon. (For every one thousand pounds increase in weight, the fuel 
efficiency tends to decrease by 7.25 miles per gallon.) 

e64% of the variation in fuel efficiency is explained by the variation in weight using the regression 
line. 

e gyhat=48.4-0.00725(3000)=26.65 mpg. y-yhat=25-26.65=-1.65. Because yhat=26.5 is greater than 
y=25, the line overestimates the observed fuel efficiency. 

h(2750,38) is the outlier. Be sure you know how to justify it using the requested graphical or 
numerical methods, not just by guessing. 

e iyhat = 42.4-0.00578x 

e jWithout outlier, r=-0.885, rsquare=0.76; with outlier, r=-0.8, rsquare=0.64. The new linear model 
is a better fit, after the outlier is removed from the data, because the new correlation coefficient is 
farther from 0 and the new coefficient of determination is larger. 


Exercise: 
Problem: 
The four data sets below were created by statistician Francis Anscomb. They show why it is important 


to examine the scatterplots for your data, in addition to finding the correlation coefficient, in order to 
evaluate the appropriateness of fitting a linear model. 


Set 1 Set 2 Set 3 Set 4 

x y x y x y x y 
10 8.04 10 9.14 10 7.46 8 6.58 
8 6.95 8 8.14 8 6.77 8 5.76 
13 7.58 13 8.74 13 12.74 8 7.71 
9 8.81 9 8.77 9 7.11 8 8.84 
11 8.33 11 9.26 11 7.81 8 8.47 


14 9.96 14 8.10 14 8.84 8 7.04 


4 4.26 
12 10.84 
7 4.82 
5 5.68 


6.13 


3.10 


9.13 


7.26 


4.74 


6 6.08 
4 5.39 
12 8.15 
7 6.42 
5 5.73 
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a. For each data set, find the least squares regression line and the correlation coefficient. What did you 
discover about the lines and values of r? 


For each data set, create a scatter plot and graph the least squares regression line. Use the graphs to 
answer the following questions: 


e bFor which data set does it appear that a curve would be a more appropriate model than a line? 
e cWhich data set has an influential point (point close to or on the line that greatly influences the 


best fit line)? 


e dWhich data set has an outlier (obviously visible on the scatter plot with best fit line graphed)? 
e eWhich data set appears to be the most appropriate to model using the least squares regression 


line? 


Solution: 


a. All four data sets have the same correlation coefficient r=0.816 and the same least squares regression 


line yhat=3+0.5x 


b. Set 2; c. Set 4; d. Set 3; e. Set 1 


12 Y Data Set 1 


ON AD 


Y 


Data Set2 


Y 


Data Set 3 


14 Y Data Set4 


Try these multiple choice questions 


Exercise: 


Problem: A correlation coefficient of -0.95 means there is a 


e AStrong positive correlation 
e¢ BWeak negative correlation 


between the two variables. 


e CStrong negative correlation 
e DNo Correlation 


Solution: 


C 
Exercise: 


Problem: 


According to the data reported by the New York State Department of Health regarding West Nile Virus 
(http://www.health.state.ny.us/nysdoh/westnile/update/update.htm) for the years 2000-2008, the 
least squares line equation for the number of reported dead birds (a) versus the number of human West 
Nile virus cases (y) is y = —10.2638 + 0.04912. If the number of dead birds reported in a year is 732, 
how many human cases of West Nile virus can be expected? r = 0.5490 


e ANo prediction can be made. 


e B19.6 
e C15 
e D38.1 


Solution: 


A 


The next three questions refer to the following data: (showing the number of hurricanes by category to 
directly strike the mainland U.S. each decade) obtained from www.nhc.noaa.gov/gifs/table6.gif A major 
hurricane is one with a strength rating of 3, 4 or 5. 


Decade 


1941-1950 


1951-1960 


1961-1970 


1971-1980 


1981-1990 


1991-2000 


2001 — 2004 


Exercise: 


Total Number of Hurricanes 


24 


17 


14 


12 


15 


14 


9 


Number of Major Hurricanes 
10 


8 


Problem: 


Using only completed decades (1941 — 2000), calculate the least squares line for the number of major 
hurricanes expected based upon the total number of hurricanes. 


© Ay=—1.67r + 0.5 
e By = 0.5x — 1.67 
e Cy=0.942 — 1.67 
© Dy=-2x+1 
Solution: 
B 
Exercise: 


Problem: The correlation coefficient is 0.942. Is this considered significant? Why or why not? 


e ANo, because 0.942 is greater than the critical value of 0.707 
e BYes, because 0.942 is greater than the critical value of 0.707 
e CNo, because 0942 is greater than the critical value of 0.811 

e DYes, because 0.942 is greater than the critical value of 0.811 


Solution: 


D 
Exercise: 
Problem: 
The data for 2001-2004 show 9 hurricanes have hit the mainland United States. The line of best fit 


predicts 2.83 major hurricanes to hit mainland U.S. Can the least squares line be used to make this 
prediction? 


e ANo, because 9 lies outside the independent variable values 

¢ BYes, because, in fact, there have been 3 major hurricanes this decade 

e¢ CNo, because 2.83 lies outside the dependent variable values 

e DYes, because how else could we predict what is going to happen this decade. 


Solution: 


A 


**Exercises 21 and 22 contributed by Roberta Bloom 


Linear Regression and Correlation: Regression Lab II (edited: Teegarden) 
This module provides a lab of Linear Regression and Correlation as a part 
of Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. Labs changed to incorporate mini-tabs. 


Correlation/Regression Lab 


Name: 


Student Learning Outcomes: 


* The student will calculate and construct the line of best fit between two 
variables. 


* The student will evaluate the relationship between two variables to 
determine if that relationship is significant. 


I. Collect the Data 


Select 15 textbooks and record the number of pages in the textbook and the 
cost. Be sure to select books from a variety of disciplines. Be sure to select 
either all hard cover or all soft cover books as the binding is a factor in the 
cost. The data may be collected from the college bookstore or a website, 
however indicate where the data was obtained. 


1. Complete the table. 


Discipline Number of pages Cost of textbook 


2. Which variable should be the dependent variable and which should be the 
independent variable? Why? 


dependent variable: 
independent variable: 


3. Enter your data into your Minitab, and create a scatter plot and attach it 
to this lab. 


Does there appear to be a relationship between the size of the book and the 
cost? dependent variable: Explain using complete sentences. 


II. Analyze the Data 


Using Minitab, determine the following information: (Include your session 
window with this lab.) 


1. correlation coefficient = 2. p-value = 


3. Is there significant correlation? Why or why not? Answer with 1-3 
complete sentences. 


4. What does the correlation imply about the relationship between the 
number of pages and the cost? Answer with 1-3 complete sentences. 


5. Using Minitab, calculate the regression line: 
6. Create a scatter plot with the regression line and include it with this lab. 


Does the line seem to fit the data? Why? Answer with 1-3 complete 
sentences. 


7. Using the data you obtained, supply an answer for the following 
scenarios: 


a. For a textbook with 400 pages, predict the cost: 
b. For a textbook with 600 pages, predict the cost: 


c. How many pages would you expect a book to have if it cost $58? 


Practice Final Exam 1 
This module is a practice final for an associated elementary statistics textbook, Collaborative Statistics. 


Questions 1-2 refer to the following: 


An experiment consists of tossing two 12-sided dice (the numbers 1-12 are printed on the sides of each dice). 


e Let Event A = both dice show an even number 
e Let Event B = both dice show a number more than 8 


Exercise: 


Problem: Events A and B are: 


e AMutually exclusive. 

e BlIndependent. 

e CMutually exclusive and independent. 

e DNeither mutually exclusive nor independent. 


Solution: 


B: Independent. 


Exercise: 


Problem: Find P (A|B) 


Solution: 


4 
C: +5 
Exercise: 

Problem: 


Which of the following are TRUE when we perform a hypothesis test on matched or paired samples? 


e ASample sizes are almost never small. 
¢ BTwo measurements are drawn from the same pair of individuals or objects. 


e CTwo sample means are compared to each other. 
e¢ DAnswer choices B and C are both true. 


Solution: 


B: Two measurements are drawn from the same pair of individuals or objects. 


Questions 4 - 5 refer to the following: 


118 students were asked what type of color their bedrooms were painted: light colors, dark colors or vibrant colors. 
The results were tabulated according to gender. 


Light colors Dark colors Vibrant colors 


Female 20 22 28 

Male 10 30 8 
Exercise: 

Problem: 


Find the probability that a randomly chosen student is male or has a bedroom painted with light colors. 


Solution: 


. 68 
B: 118 


Exercise: 
Problem: 


Find the probability that a randomly chosen student is male given the student’s bedroom is painted with dark 


colors. 


Solution: 
. 30 
D: 52 


Questions 6 — 7 refer to the following: 


We are interested in the number of times a teenager must be reminded to do his/her chores each week. A survey of 
40 mothers was conducted. The table below shows the results of the survey. 


£ P (2) 
0 w 
5 


1 
4 ry 
5 ry 
Exercise: 


Problem: Find the probability that a teenager is reminded 2 times. 


e A8 
e BS 
ce 
e D2 


Solution: 


» & 
B: 4 


Exercise: 
Problem: Find the expected number of times a teenager is reminded to do his/her chores. 


e AlS 
e B2.78 
e C1.0 
e D3.13 


Solution: 


B: 2.78 


Questions 8 — 9 refer to the following: 


On any given day, approximately 37.5% of the cars parked in the De Anza parking structure are parked crookedly. 
(Survey done by Kathy Plum.) We randomly survey 22 cars. We are interested in the number of cars that are 
parked crookedly. 

Exercise: 


Problem: For every 22 cars, how many would you expect to be parked crookedly, on average? 


e A8.25 
e Bll 
e C18 
e D7.5 


Solution: 


A: 8.25 


Exercise: 


Problem: What is the probability that at least 10 of the 22 cars are parked crookedly. 


e A0.1263 
e BO.1607 
e €0.2870 
e DO0.8393 


Solution: 


C: 0.2870 
Exercise: 
Problem: 
Using a sample of 15 Stanford-Binet IQ scores, we wish to conduct a hypothesis test. Our claim is that the 


mean IQ score on the Stanford-Binet IQ test is more than 100. It is known that the standard deviation of all 
Stanford-Binet IQ scores is 15 points. The correct distribution to use for the hypothesis test is: 


e ABinomial 
e BStudent's-t 
e CNormal 

e DUniform 


Solution: 


C: Normal 


Questions 11 — 13 refer to the following: 


De Anza College keeps statistics on the pass rate of students who enroll in math classes. In a sample of 1795 
students enrolled in Math 1A (1st quarter calculus), 1428 passed the course. In a sample of 856 students enrolled in 
Math 1B (2nd quarter calculus), 662 passed. In general, are the pass rates of Math 1A and Math 1B statistically the 
same? Let A = the subscript for Math 1A and B = the subscript for Math 1B. 

Exercise: 


Problem: If you were to conduct an appropriate hypothesis test, the alternate hypothesis would be: 


e AH,: pa = pp 
¢ BH,: pa > pp 
¢ CH: pa = pp 


¢ DH,: pa * pB 


Solution: 
D: Ha: pa * pp 
Exercise: 
Problem: The Type | error is to: 


e Aconclude that the pass rate for Math 1A is the same as the pass rate for Math 1B when, in fact, the pass 
rates are different. 


¢ Bconclude that the pass rate for Math 1A is different than the pass rate for Math 1B when, in fact, the 
pass rates are the same. 

¢ Cconclude that the pass rate for Math 1A is greater than the pass rate for Math 1B when, in fact, the pass 
rate for Math 1A is less than the pass rate for Math 1B. 

¢ Dconclude that the pass rate for Math 1A is the same as the pass rate for Math 1B when, in fact, they are 
the same. 


Solution: 


B: conclude that the pass rate for Math 1A is different than the pass rate for Math 1B when, in fact, the pass 
rates are the same. 


Exercise: 
Problem: The correct decision is to: 
e Areject H, 


e Bnot reject H, 
e CThere is not enough information given to conduct the hypothesis test 


Solution: 
B: not reject H, 


Kia, Alejandra, and Iris are runners on the track teams at three different schools. Their running times, in minutes, 
and the statistics for the track teams at their respective schools, for a one mile run, are given in the table below: 


Running Time School Average Running Time School Standard Deviation 
Kia 4.9 5.2 15 
Alejandra 4.2 4.6 25 
Iris 45 4.9 12 


Exercise: 


Problem: Which student is the BEST when compared to the other runners at her school? 
e AKia 
e BAlejandra 


e Clris 
e Dimpossible to determine 


Solution: 


C: Iris 


Questions 15 — 16 refer to the following: 


The following adult ski sweater prices are from the Gorsuch Ltd. Winter catalog: 


{$212, $292, $278, $199$280, $236} 


Assume the underlying sweater price population is approximately normal. The null hypothesis is that the mean 
price of adult ski sweaters from Gorsuch Ltd. is at least $275. 
Exercise: 


Problem: The correct distribution to use for the hypothesis test is: 


ANormal 
BBinomial 
CStudent's-t 
DExponential 


Solution: 


C: Student's-t 


Exercise: 


Problem: The hypothesis test: 


e Ais two-tailed 
e Bis left-tailed 
e Cis right-tailed 
e Dhas no tails 


Solution: 


B: is left-tailed 
Exercise: 
Problem: 
Sara, a Statistics student, wanted to determine the mean number of books that college professors have in their 


office. She randomly selected 2 buildings on campus and asked each professor in the selected buildings how 
many books are in his/her office. Sara surveyed 25 professors. The type of sampling selected is a: 


e Asimple random sampling 
e Bsystematic sampling 

e Ccluster sampling 

¢ Dstratified sampling 


Solution: 


C: cluster sampling 
Exercise: 
Problem: 


A clothing store would use which measure of the center of data when placing orders for the typical "middle" 
customer? 


e A Mean 
e BMedian 
e CMode 
e DIQR 


Solution: 


B: Median 


Exercise: 


Problem: In a hypothesis test, the p-value is 


e Athe probability that an outcome of the data will happen purely by chance when the null hypothesis is 


true. 


¢ Bcalled the preconceived alpha. 
¢ Ccompared to beta to decide whether to reject or not reject the null hypothesis. 
e DAnswer choices A and B are both true. 


Solution: 


A: the probability that an outcome of the data will happen purely by chance when the null hypothesis is true. 


Questions 20 - 22 refer to the following: 


A community college offers classes 6 days a week: Monday through Saturday. Maria conducted a study of the 

students in her classes to determine how many days per week the students who are in her classes come to campus 
for classes. In each of her 5 classes she randomly selected 10 students and asked them how many days they come 
to campus for classes. Each of her classes are the same size. The results of her survey are summarized in the table 


below. 


Number of Days on 
Campus 


1 


2 


Exercise: 


Frequency 
2 
12 


10 


Relative 
Frequency 


24 


.20 


02 


Cumulative Relative 
Frequency 


98 


Problem: Combined with convenience sampling, what other sampling technique did Maria use? 


e Asimple random 
e Bsystematic 

e Ccluster 

¢ Dstratified 


Solution: 


D: stratified 


Exercise: 


Problem: How many students come to campus for classes 4 days a week? 


e A49 
e B25 
e €30 
e D13 


Solution: 
B: 25 
Exercise: 
Problem: What is the 60th percentile for the this data? 


e A2 
e BB 
e C4 
e DS 


Solution: 
Cc: 4 
The next two questions refer to the following: 


The following data are the results of a random survey of 110 Reservists called to active duty to increase security at 
California airports. 


Number of Dependents Frequency 
0 11 
1 27 
2 33 


3 20 


Number of Dependents Frequency 
4 19 


Exercise: 
Problem: 


Construct a 95% Confidence Interval for the true population mean number of dependents of Reservists called 
to active duty to increase security at California airports. 


 A(1.85, 2.32) 
* B(1.80, 2.36) 
* C(1.97, 2.46) 
* D(1.92, 2.50) 


Solution: 


A: (1.85, 2.32) 


Exercise: 


Problem: The 95% confidence Interval above means: 


e A5% of Confidence Intervals constructed this way will not contain the true population aveage number of 
dependents. 


¢ BWe are 95% confident the true population mean number of dependents falls in the interval. 
e CBoth of the above answer choices are correct. 
¢ DNone of the above. 


Solution: 


C: Both above are correct. 


Exercise: 


Problem: X~U (4, 10). Find the 30th percentile. 


e A0.3000 
¢ BB 

e €5.8 

e DG6.1 


Solution: 


C: 5.8 


Exercise: 


Problem: If X~Exp (0.8), then P (x<p) = 


e A0.3679 
e BO.4727 
e C0.6321 
e Dcannot be determined 


Solution: 


C: 0.6321 
Exercise: 
Problem: 


The lifetime of a computer circuit board is normally distributed with a mean of 2500 hours and a standard 
deviation of 60 hours. What is the probability that a randomly chosen board will last at most 2560 hours? 


e A0.8413 
e BO.1587 
e C0.3461 
e D0.6539 


Solution: 
A: 0.8413 


Exercise: 


Problem: 


A survey of 123 Reservists called to active duty as a result of the September 11, 2001, attacks was conducted 
to determine the proportion that were married. Eighty-six reported being married. Construct a 98% 
confidence interval for the true population proportion of reservists called to active duty that are married. 


+ A(0.6030, 0.7954) 
+ B(0.6181, 0.7802) 
* €(0.5927, 0.8057) 
* D(0.6312, 0.7672) 


Solution: 


A: (0.6030, 0.7954) 
Exercise: 


Problem: 


Winning times in 26 mile marathons run by world class runners average 145 minutes with a standard 
deviation of 14 minutes. A sample of the last 10 marathon winning times is collected. 


Let x = mean winning times for 10 marathons. 


The distribution for z is: 


e AN 145, me 
V10 

° BN(145, 14) 

. Cto 

¢ Dtio 


Solution: 


; sty 
A: N 145, 74 


Exercise: 


Problem: 


Suppose that Phi Beta Kappa honors the top 1% of college and university seniors. Assume that grade point 
means (G.P.A.) at a certain college are normally distributed with a 2.5 mean and a standard deviation of 0.5. 
What would be the minimum G.P.A. needed to become a member of Phi Beta Kappa at that college? 


e A3.99 
e B1.34 
e €3.00 
e¢ D3.66 


Solution: 


D: 3.66 


The number of people living on American farms has declined steadily during this century. Here are data on the 
farm population (in millions of persons) from 1935 to 1980. 


Year 1935 1940 1945 1950 1955 1960 1965 1970 1975 198 


Population 32.1 30.5 24.4 23.0 19.1 15.6 12.4 9.7 8.9 7.2 


The linear regression equation is y-hat = 1166.93 — 0.5868x 
Exercise: 


Problem: What was the expected farm population (in millions of persons) for 1980? 


e A7.2 
e B5S.1 
e €6.0 
e D8.0 


Solution: 


B:5.1 


Exercise: 


Problem: In linear regression, which is the best possible SSE? 


e A13.46 
e B18.22 
e €24.05 
e D16.33 


Solution: 


A: 13.46 
Exercise: 
Problem: 


In regression analysis, if the correlation coefficient is close to 1 what can be said about the best fit line? 


e Alt is a horizontal line. Therefore, we can not use it. 

e BThere is a strong linear pattern. Therefore, it is most likely a good model to be used. 

e CThe coefficient correlation is close to the limit. Therefore, it is hard to make a decision. 
¢ DWe do not have the equation. Therefore, we can not say anything about it. 


Solution: 


B: There is a strong linear pattern. Therefore, it is most likely a good model to be used. 


Question 34-36 refer to the following: 


A study of the career plans of young women and men sent questionnaires to all 722 members of the senior class in 
the College of Business Administration at the University of Illinois. One question asked which major within the 
business program the student had chosen. Here are the data from the students who responded. 


Female Male 
Accounting 68 56 
Administration 91 40 
Ecomonics 5 6 
Finance 61 59 


Does the data suggest that there is a relationship between the gender of students and their choice of major? 


Exercise: 


Problem: The distribution for the test is: 


° AChi’, 

° BChi’; 

° Ct721 

° DN (0, 1) 


Solution: 


B: Chi’; 


Exercise: 


Problem: The expected number of female who choose Finance is : 


e A37 
e Bol 
e C60 
e D70 


Solution: 


D: 70 


Exercise: 


Problem: The p-value is 0.0127 and the level of significance is 0.05. The conclusion to the test is: 


e AThere is insufficient evidence to conclude that the choice of major and the gender of the student are not 
independent of each other. 

e BThere is sufficient evidence to conclude that the choice of major and the gender of the student are not 
independent of each other. 

e CThere is sufficient evidence to conclude that students find Economics very hard. 

e DtThere is in sufficient evidence to conclude that more females prefer Administration than males. 


Solution: 
B: There is sufficient evidence to conclude that the choice of major and the gender of the student are not 
independent of each other. 

Exercise: 
Problem: 
An agency reported that the work force nationwide is composed of 10% professional, 10% clerical, 30% 
skilled, 15% service, and 35% semiskilled laborers. A random sample of 100 San Jose residents indicated 15 


professional, 15 clerical, 40 skilled, 10 service, and 20 semiskilled laborers. At a = .10 does the work force in 
San Jose appear to be consistent with the agency report for the nation? Which kind of test is it? 


¢ AChi” goodness of fit 

¢ BChi’ test of independence 

e ClIndependent groups proportions 
e¢ DUnable to determine 


Solution: 


A: Chi? goodness of fit 


Practice Final Exam 2 

This module is a practice final for an associated elementary statistics textbook, 
Collaborative Statistics, available for Fall 2008. 

Exercise: 


Problem: 


A study was done to determine the proportion of teenagers that own a car. 
The population proportion of teenagers that own a car is the 


e Astatistic 
e Bparameter 
¢ Cpopulation 
e Dvariable 


Solution: 


B: parameter 


The next two questions refer to the following data: 


value frequency 
0 1 
i 4 
2 7 
° 9 
6 4 


Exercise: 


Problem: The box plot for the data is: 


e A 


Solution: 


A 
Exercise: 


Problem: 


If 6 were added to each value of the data in the table, the 15th percentile of 
the new list of values is: 


e AG 
e Bl 
e C7 
e D8 


Solution: 


Ce7 


The next two questions refer to the following situation: 


Suppose that the probability of a drought in any independent year is 20%. Out of 
those years in which a drought occurs, the probability of water rationing is 10%. 
However, in any year, the probability of water rationing is 5%. 

Exercise: 


Problem: 
What is the probability of both a drought and water rationing occurring? 


e A0.05 


¢ BO.01 
« €0.02 
¢ DO.30 

Solution: 

GC: 0:02 

Exercise: 

Problem: Which of the following is true? 
e Adrought and water rationing are independent events 
¢ Bdrought and water rationing are mutually exclusive events 
¢ Cnone of the above 


Solution: 


C: none of the above 


The next two questions refer to the following situation: 


Suppose that a survey yielded the following data: 


gender apple pumpkin pecan 
female 40 10 30 
male 20 30 10 


Favorite Pie Type 


Exercise: 


Problem: 


Suppose that one individual is randomly chosen. The probability that the 
person’s favorite pie is apple or the person is male is: 


Solution: 
. 100 
D: 140 


Exercise: 


Problem: Suppose H, is: Favorite pie type and gender are independent. 
The p-value is: 

e AvO0 

« Bl 


e €0.05 
e Dcannot be determined 


Solution: 


A: +0 


The next two questions refer to the following situation: 


Let’s say that the probability that an adult watches the news at least once per 
week is 0.60. We randomly survey 14 people. Of interest is the number that 
watch the news at least once per week. 

Exercise: 


Problem: Which of the following statements is FALSE? 


AX~B(14, 0.60) 

¢ BThe values for x are: { 1, 2, 3,...,14} 
eCu=84 

© DP(X =5) = 0.0408 


Solution: 


B: The values for x are: { 1, 2, 8,..., 14} 


Exercise: 


Problem: Find the probability that at least 6 adults watch the news. 


6 
e AcT 
e B0.8499 
e C0.9417 
e D0.6429 


Solution: 


C: 0.9417 
Exercise: 


Problem: 


The following histogram is most likely to be a result of sampling from 
which distribution? 


e AChi-Square with df = 6 
e¢ BExponential 


e CUniform 
e DBinomial 


Solution: 


D: Binomial 


The ages of campus day and evening students is known to be normally 
distributed. A sample of 6 campus day and evening students reported their ages 
(in years) as: {18, 35, 27, 45, 20, 20} 

Exercise: 


Problem: 


What is the error bound for the 90% confidence interval of the true average 
age? 


e All.2 
e B22.3 
e C175 
e D8.7 


Solution: 


D: 8.7 
Exercise: 
Problem: 


If a normally distributed random variable has p = 0 and o = 1, then 
97.5% of the population values lie above: 


A-1.96 
B1.96 
Cl 
D-1 


Solution: 


A: -1.96 


The next three questions refer to the following situation: 


The amount of money a customer spends in one trip to the supermarket is 
known to have an exponential distribution. Suppose the average amount of 
money a customer spends in one trip to the supermarket is $72. 

Exercise: 


Problem: 


What is the probability that one customer spends less than $72 in one trip 
to the supermarket? 


e A0.6321 
e BO.5000 
e €0.3714 
e D1 


Solution: 


A: 0.6321 
Exercise: 


Problem: 


How much money altogether would you expect next 5 customers to spend 
in one trip to the supermarket (in dollars)? 


e C5184 
e D360 


Solution: 


D: 360 


Exercise: 


Problem: 


If you want to find the probability that the mean of 50 customers is less 
than $60, the distribution to use is: 


° A N (72,72) 


, 2 
BN (72, 2.) 


e¢ CExp(72) 
¢ DExp(+) 


Solution: 
; 72 
B: N(72, 2.) 


The next three questions refer to the following situation: 


The amount of time it takes a fourth grader to carry out the trash is uniformly 
distributed in the interval from 1 to 10 minutes. 
Exercise: 


Problem: 


What is the probability that a randomly chosen fourth grader takes more 
than 7 minutes to take out the trash? 


Solution: 
8 
ar 


Exercise: 


Problem: 
Which graph best shows the probability that a randomly chosen fourth 


grader takes more than 6 minutes to take out the trash given that he/she has 
already taken more than 3 minutes? 


f(x) 


f(x) 


a) 


Solution: 


D 
Exercise: 
Problem: 


We should expect a fourth grader to take how many minutes to take out the 
trash? 


e A4.5 
e B5.5 
e G5 

e D10 


Solution: 


B: 5.5 


The next three questions refer to the following situation: 


At the beginning of the quarter, the amount of time a student waits in line at the 
campus Cafeteria is normally distributed with a mean of 5 minutes and a 
standard deviation of 1.5 minutes. 

Exercise: 


Problem: What is the 90th percentile of waiting times (in minutes)? 


e Al.28 
e B90 

e C7.47 
e D6.92 


Solution: 


D: 6.92 


Exercise: 


Problem: The median waiting time (in minutes) for one student is: 


e AS 

e B50 
© C25 
e D1.5 


Solution: 


A:5 
Exercise: 


Problem: 


Find the probability that the average wait time of 10 students is at most 5.5 
minutes. 


e A0.6301 
e BO.8541 
C0.3694 
e DO.1459 


Solution: 


B: 0.8541 
Exercise: 


Problem: 


A sample of 80 software engineers in Silicon Valley is taken and it is found 
that 20% of them earn approximately $50,000 per year. A point estimate 
for the true proportion of engineers in Silicon Valley who earn $50,000 per 
year is: 


A16 
BO.2 
Cl 
DO0.95 


Solution: 


B: 0.2 


Exercise: 


Problem: If P(Z < zq) = 0. 1587 where Z~N(0, 1) , then a is equal to: 


A-1 
BO0.1587 
C0.8413 
D1 


Solution: 


A: -1 


Exercise: 
Problem: 
A professor tested 35 students to determine their entering skills. At the end 


of the term, after completing the course, the same test was administered to 
the same 35 students to study their improvement. This would be a test of: 


e Aindependent groups 

e B2 proportions 

¢ Cmatched pairs, dependent groups 
e¢ Dexclusive groups 


Solution: 


C: matched pairs, dependent groups 
Exercise: 
Problem: 


A math exam was given to all the third grade children attending ABC 
School. Two random samples of scores were taken. 


n x S 
Boys 55 82 5 
Girls 60 86 y 


Which of the following correctly describes the results of a hypothesis test 
of the claim, “There is a difference between the mean scores obtained by 
third grade girls and boys at the 5 % level of significance”? 


¢ ADo not reject H,. There is insufficient evidence to conclude that 
there is a difference in the mean scores. 

¢ BDo not reject H,. There is sufficient evidence to conclude that there 
is a difference in the mean scores. 

¢ CReject H,. There is insufficient evidence to conclude that there is no 
difference in the mean scores. 

¢ DReject H,. There is sufficient evidence to conclude that there is a 
difference in the mean scores. 


Solution: 


D: Reject Ho. There is sufficient evidence to conclude that there is a 
difference in the mean scores. 


Exercise: 


Problem: 


In a survey of 80 males, 45 had played an organized sport growing up. Of 
the 70 females surveyed, 25 had played an organized sport growing up. We 
are interested in whether the proportion for males is higher than the 
proportion for females. The correct conclusion is: 


e AThere is insufficient information to conclude that the proportion for 
males is the same as the proportion for females. 

¢ BThere is insufficient information to conclude that the proportion for 
males is not the same as the proportion for females. 

e CThere is sufficient evidence to conclude that the proportion for 
males is higher than the proportion for females. 

¢ DNot enough information to determine. 


Solution: 


C: There is sufficient evidence to conclude that the proportion for males is 
higher than the proportion for females. 


Exercise: 


Problem: 


Note: Chi-Square Test of a Single Variance; Not all classes cover this topic. 
From past experience, a Statistics teacher has found that the average score 
on a midterm is 81 with a standard deviation of 5.2. This term, a class of 49 
students had a standard deviation of 5 on the midterm. Do the data indicate 
that we should reject the teacher’s claim that the standard deviation is 5.2? 
Use a = 0.05. 


e AYes 
e BNo 
¢ CNot enough information given to solve the problem 


Solution: 


B: No 
Exercise: 


Problem: 


Note: F Distribution Test of ANOVA; Not all classes cover this topic. 
Three loading machines are being compared. Ten samples were taken for 
each machine. Machine I took an average of 31 minutes to load packages 
with a standard deviation of 2 minutes. Machine II took an average of 28 
minutes to load packages with a standard deviation of 1.5 minutes. 
Machine III took an average of 29 minutes to load packages with a 
standard deviation of 1 minute. Find the p-value when testing that the 
average loading times are the same. 


e Athe p—value is close to 0 
¢ Bp-—value is close to 1 
¢ CNot enough information given to solve the problem 


Solution: 


B: p-value is close to 1. 


The next three questions refer to the following situation: 


A corporation has offices in different parts of the country. It has gathered the 
following information concerning the number of bathrooms and the number of 
employees at seven sites: 


Number 
of 
employees 
x 


650 730 810 900 102 107 1150 


Number 
of 
bathrooms 


a 


40 50 34 61 82 110 121 


Exercise: 


Problem: 


Is the correlation between the number of employees and the number of 
bathrooms significant? 


e AYes 
e BNo 
e CNot enough information to answer question 


Solution: 
B: No 
Exercise: 
Problem: The linear regression equation is: 


« Ay = 0.0094 — 79.96z 


e By = 79.96 + 0.0094z 
« Cy = 79.96 — 0.0094x 
° Dy = —0.0094 + 79.96x 


Solution: 


C: y = 79.96” — 0.0094 
Exercise: 
Problem: 


If a site has 1150 employees, approximately how many bathrooms should it 
have? 


A69 

B91 

C91,954 

DWe should not be estimating here. 


Solution: 


D: We should not be estimating here. 
Exercise: 
Problem: 


Note: Chi-Square Test of a Single Variance; Not all classes cover this topic. 
Suppose that a sample of size 10 was collected, with x = 4.4 ands =1.4. 


H, : o?= 1.6 vs. Hy : 0? # 1.6. Which graph best describes the results of 
the test? 


2 | | b. | | 

— + - 1.96 1.96 2 
c. | | d. | | 

11.03 I? 2 23 2233 t 


Solution: 


A 
Exercise: 
Problem: 


64 backpackers were asked the number of days their latest backpacking trip 
was. The number of days is given in the table below: 


# of days 1 2 3 4 fs) 6 7 8 


Frequency S) 9 6 12 vi 10 fs) 10 


Conduct an appropriate test to determine if the distribution is uniform. 


¢ AThe p-value is > 0.10. There is insufficient information to conclude 
that the distribution is not uniform. 

e BThe p—value is < 0.01. There is sufficient information to conclude 
the distribution is not uniform. 

e CThe p-value is between 0.01 and 0.10, but without alpha (a) there is 
not enough information 

e DThere is no such test that can be conducted. 


Solution: 
A: The p-value is > 0.10. There is insufficient information to conclude 
that the distribution is not uniform. 
Exercise: 
Problem: 
Note: F Distribution test of One-Way ANOVA; Not all classes cover this 


topic. Which of the following statements is true when using one-way 
ANOVA? 


AThe populations from which the samples are selected have different 
distributions. 

e BThe sample sizes are large. 

CThe test is to determine if the different groups have the same means. 
¢ DThere is a correlation between the factors of the experiment. 


Solution: 


C: The test is to determine if the different groups have the same means. 


Data Sets 

This module provides data sets for use with the Collaborative Statistics 
textbook/collection. Data sets include a series of recorded motorcycle race and 
practice lap times as well as IPO stock prices. 


Lap Times 


The following tables provide lap times from Terri Vogel's Log Book. Times are 
recorded in seconds for 2.5-mile laps completed in a series of races and practice 
runs. 


1 2 3 4 5 6 7 
ons 135 | 130 | 131 | 132 | 130 | 131 | 133 
as 134 | 131 | 131 | 129 | 128 | 128 | 129 
— 129 | 128 | 127 | 127 | 130 | 127 | 129 
oa- 125 | 125 | 126 | 125 | 124 | 125 | 125 
aa 133 | 132 | 132 | 132 | 131 | 130 | 132 
gace | 130 | 130 | 130 | 129 | 129 | 130 | 129 
Race 


132 131 133 131 134 134 131 


132 


135 


132 


134 


128 


132 


136 


129 


134 


129 


130 


130 


131 


131 


130 


127 


131 


129 


129 


131 


129 


129 


ia, 


131 


132 


130 


128 


131 


129 


129 


132 


130 


129 


128 


132 


131 


130 


128 


131 


129 


128 


131 


130 


129 


126 


130 


130 


131 


128 


132 


129 


128 


132 


133 


129 


127 


131 


129 


130 


129 


130 


129 


129 


132 


133 


129 


124 


130 


129 


130 


128 


130 


129 


129 


132 


127 


128 


Lap Lap Lap Lap Lap Lap Lap 
1 2 3 4 5 6 7 


Race 


20 131 128 130 128 129 130 130 


Race Lap Times (in Seconds) 


1 2 3 4 ) 6 7 
—— 142 143 180 137 134 134 ~=© 172 
aes 140 | 135 | 134 | 133 | 128 | 128 | 131 
ae 130 133 | 130 | 128 | 135 | 133 | 133 
gence | 141 | 136 | 137 | 136 | 136 | 136 | 145 
sae 140 138 136 137 135 134 134 
pracuce | 142 | 142 | 139 | 138 | 129 | 129 | 127 
Pracuce | 139 | 137 | 135 | 135 | 137 | 134 | 135 
Practice 


8 143 136 134 133 134 133 132 


Practice 
9 


Practice 
10 


Practice 
11 


Practice 
12 


Practice 
13 


Practice 
14 


Practice 
15 


Practice Lap Times (in Seconds) 


Stock Prices 


131 


143 


132 


149 


133 


138 


130 


139 


133 


144 


132 


136 


128 


139 


131 


144 


137 


133 


129 


138 


129 


139 


133 


133 


127 


138 


128 


138 


134 


132 


128 


137 


127 


138 


130 


131 


127 


138 


126 


137 


131 


131 


The following table lists initial public offering (IPO) stock prices for all 1999 
stocks that at least doubled in value during the first day of trading. This is 


historical data. 


$17.00 


$23.00 


$14.00 


$16.00 


$12.00 


$26.00 


$20.00 
$18.00 
$18.00 
$16.00 
$16.00 
$17.00 
$16.00 
$8.00 

$19.00 
$13.00 
$21.00 
$17.00 
$14.00 
$15.00 
$24.00 
$14.00 
$24.00 
$16.00 
$8.00 


$21.00 


$22.00 
$21.00 
$17.00 
$10.00 
$14.00 
$16.00 
$18.00 
$20.00 
$15.00 
$14.00 
$17.00 
$19.00 
$21.00 
$23.00 
$20.00 
$19.00 
$16.00 
$15.00 
$23.00 


$34.00 


$14.00 
$21.00 
$15.00 
$20.00 
$15.00 
$15.00 
$9.00 

$17.00 
$21.00 
$15.00 
$28.00 
$18.00 
$12.00 
$14.00 
$14.00 
$16.00 
$8.00 

$7.00 

$12.00 


$16.00 


$15.00 
$19.00 
$25.00 
$12.00 
$20.00 
$15.00 
$18.00 
$14.00 
$12.00 
$14.00 
$17.00 
$17.00 
$18.00 
$16.00 
$14.00 
$38.00 
$18.00 
$19.00 
$18.00 


$26.00 


$22.00 
$15.00 
$14.00 
$16.00 
$20.00 
$19.00 
$18.00 
$11.00 
$8.00 

$13.41 
$19.00 
$15.00 
$24.00 
$12.00 
$15.00 
$20.00 
$17.00 
$12.00 
$20.00 


$14.00 


$18.00 
$21.00 
$30.00 
$17.44 
$16.00 
$48.00 
$20.00 
$16.00 
$16.00 
$28.00 


$16.00 


IPO Offer Prices 


Note:Data compiled by Jay R. Ritter of Univ. of Florida using data from 
Securities Data Co. and Bloomberg. 


Collaborative Statistics: Group Project - Teegarden 


Student Learning Objectives 


e The student will identify a hypothesis testing problem in print. 

e The student will conduct a survey to verify or dispute the results of the 
hypothesis test. 

e The student will summarize the article, analysis, and conclusions in a 
report. 


Instructions 


As you complete each task below, check it off. Answer all questions in your 
summary. This project may be done in pairs or a group of three. Be sure to 
ensure that all the students participate equally in the work. This project is 
worth 20% of your final grade. 


e ____ Find an article in a newspaper, magazine or on the internet 
which makes a claim about ONE population mean or ONE population 
proportion. The claim may be based upon a survey that the article was 
reporting on. Decide whether this claim is the null or alternate 
hypothesis. 

e ____ Copy or print out the article and include a copy in your project, 
along with the source. 

e ____ State how you will collect your data. (Convenience sampling is 
not acceptable.) 

° Conduct your survey. You must have more than 50 responses 
in your sample. When you hand in your final project, attach the tally 
sheet or the packet of questionnaires that you used to collect data. Your 
data must be real. 

e __ State the statistics that are a result of your data collection: 
sample size, sample mean, and sample standard deviation, OR sample 
size and number of successes. 

e __ Make 2 copies of the appropriate solution sheet. 

° Record the hypothesis test on the solution sheet, based on your 
experiment. Do a DRAFT solution first on one of the solution sheets 
and check it over carefully. Have a classmate check your solution to 


see if it is done correctly. Make your decision using a 5% level of 
significance. Include the 95% confidence interval on the solution 
sheet. 

e __Create at least two different graphs to illustrate your data. 
This may be a pie or bar chart or may be a histogram or box plot, 
depending on the nature of your data. Produce graphs that makes sense 
for your data and gives useful visual information about your data. 
Include an analysis of the graphs in your summary. 

e ____ Write your summary (in complete sentences and paragraphs, 
with proper grammar and correct spelling) that describes the project. 
The summary MUST include: 


o 1Brief discussion of the article, including the source. 

o 2Statement of the claim made in the article (one of the 
hypotheses). 

o 3Detailed description of how, where, and when you collected the 
data, including the sampling technique. Did you use cluster, 
stratified, systematic, or simple random sampling (using a random 
number generator)? As stated above, convenience sampling is not 
acceptable. 

o ADiscuss the shape of your data and the relevant inforamition 
obtained from your graphs. 

© 5Conclusion about the article claim in light of your hypothesis 
test. This is the conclusion of your hypothesis test, stated in 
words, in the context of the situation in your project in sentence 
form, as if you were writing this conclusion for a non-statistician. 

o 6Sentence interpreting your confidence interval in the context of 
the situation in your project. 


Assignment Checklist 


Turn in the following typed (12 point) and stapled packet for your final 
project: 


° Cover sheet containing your name(s), class time, and the name of 
your study. 
° Summary, which includes all items listed on summary checklist. 


e ___ Solution sheet neatly and completely filled out. The solution 
sheet does not need to be typed. 

e ____Graphic representation of your data, created following the 
guidelines discussed above. Include only graphs which are appropriate 
and useful. 

e ____Raw data collected AND a table summarizing the sample data 
(n, xbar and s; or x, n, and p’, as appropriate for your hypotheses). The 
raw data does not need to be typed, but the summary does. Hand in the 
data as you collected it. (Either attach your tally sheet or an envelope 
containing your questionnaires. ) 


Solution Sheet: Hypothesis Testing for Single Mean and Single Proportion 
This module provides a solution sheet for the Hypothesis Testing: Single 
Mean and Single Proportion chapter of the Collaborative Statistics 
textbook/collection. 


Class Time: 
Name: 


e aH: 

e bH.: 

e clIn words, CLEARLY state what your random variable X or P 
represents. 

e dState the distribution to use for the test. 

e eWhat is the test statistic? 

e fWhat is the p-value? In 1 — 2 complete sentences, explain what the p- 
value means for this problem. 

e gUse the previous information to sketch a picture of this situation. 
CLEARLY, label and scale the horizontal axis and shade the region(s) 
corresponding to the p-value. 


e hIndicate the correct decision (“reject” or “do not reject” the null 
hypothesis), the reason for it, and write an appropriate conclusion, 
using complete sentences. 


iAlpha: 

iiDecision: 

iiiReason for decision: 
ivConclusion: 


o Oo 0 90 


e iConstruct a 95% Confidence Interval for the true mean or proportion. 
Include a sketch of the graph of the situation. Label the point estimate 
and the lower and upper bounds of the Confidence Interval. 


Solution Sheet: Hypothesis Testing for Two Means, Paired Data, and Two 
Proportions 

This module provides a solution sheet for the Hypothesis Testing: Two 
Means, Paired Data, Two Proportions chapter of the Collaborative Statistics 
textbook/collection. 


Class Time: 
Name: 


e aH: 

e bH.: 

e cin words, clearly state what your random variable X X , 
| os P -or Xq represents. 

e dState the distribution to use for the test. 

e eWhat is the test statistic? 

e fWhat is the p-value? In 1 — 2 complete sentences, explain what the p- 
value means for this problem. 

e gUse the previous information to sketch a picture of this situation. 
CLEARLY label and scale the horizontal axis and shade the region(s) 
corresponding to the p-value. 


e hIndicate the correct decision (“reject” or “do not reject” the null 
hypothesis), the reason for it, and write an appropriate conclusion, 
using complete sentences. 


iAlpha: 

iiDecision: 

iiiReason for decision: 
ivConclusion: 


o Oo 0 0 


e iln complete sentences, explain how you determined which 
distribution to use. 


Solution Sheet: The Chi-Square Distribution 
This module provides a solution sheet for the Chi-Square Distribution 
chapter of the Collaborative Statistics textbook/collection. 


Class Time: 
Name: 


e aH: 

e bH.: 

e cWhat are the degrees of freedom? 

e dState the distribution to use for the test. 

e eWhat is the test statistic? 

e fWhat is the p-value? In 1 — 2 complete sentences, explain what the p- 
value means for this problem. 

e gUse the previous information to sketch a picture of this situation. 
Clearly label and scale the horizontal axis and shade the region(s) 
corresponding to the p-value. 


e hIndicate the correct decision (“reject” or “do not reject” the null 
hypothesis) and write appropriate conclusions, using complete 
sentences. 


iAlpha: 

iiDecision: 

iiiReason for decision: 
ivConclusion: 


o Oo 0 90 


English Phrases Written Mathematically 
This module provides an overview of commonly used phrases in statistics 
and their mathematical equivalents. 


English Phrases Written Mathematically 


When the English says: Interpret this as: 
Xis at least 4. X>A4 
The minimum of X is 4. X>A4 
X is no less than 4. X>A4 
X is greater than or equal to 4. X>A4 
X is at most 4. X<A4 
The maximum of X is 4. X<4 
Xis no more than 4. X<4 
X is less than or equal to 4. X<A4 
X does not exceed 4. X<A4 


Xis greater than 4. X>A4 


When the English says: Interpret this as: 


X is more than 4. X>A4 
Xexceeds 4. X>A4 
Xis less than 4. X<A4 
There are fewer X than 4. X<A4 
Xis 4. X=A4 
Xis equal to 4. X=4 
Xis the same as 4. » a. 
Xis not 4. XA 
Xis not equal to 4. X #4 
Xis not the same as 4. X#A4 


Xis different than 4. XA 


Symbols and their Meanings 
This module defines symbols used throughout the Collaborative Statistics 
textbook. 


Chapter 
(1st used) Symbol Spoken Meaning 
Sampling The square root 
and Data Vv of os 
Sampling Pj pee : 
Sacha 1 i (a specific 
number) 
Descriptive the first 
Statistics Ql uae oue quartile 
Descriptive the second 
Statistics Q2 Qamewe quartile 
Descriptive : the third 
Statistics Q3 ae tae quartile 
Descriptive IQR inter-quartile Q3- 
Statistics range Q1i=IQR 
Descriptive “ “ba sample 
Statistics mean 
Descriptive P ani population 
Statistics mean 


Chapter 
(1st used) 


Descriptive 
Statistics 


Descriptive 
Statistics 


Descriptive 
Statistics 


Descriptive 
Statistics 


Descriptive 
Statistics 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Symbol 


S Sy SX 


O Ox OX 


{} 


Spoken 


s-squared 


sigma 


sigma-squared 


capital sigma 


brackets 


Event A 


probability of 
A 


Meaning 
sample 
standard 


deviation 


sample 
variance 


population 
standard 


deviation 


population 
variance 


sum 


set notation 


sample 
space 


event A 


probability 
of A 
occurring 


Chapter 
(1st used) 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Probability 
Topics 


Discrete 
Random 
Variables 


Symbol 


P(A| B) 


P (Aor B) 


P (Aand B) 


A? 


PDF 


Spoken 


probability of 
A given B 


prob. of A or B 


prob. of A and 
B 


A-prime, 
complement of 
A 


prob. of 
complement of 
A 


green on first 
pick 


prob. of green 
on first pick 


prob. 
distribution 
function 


Meaning 


prob. of A 
occurring 
given B has 
occurred 


prob. of A 
or B or both 
occurring 
prob. of 
both A and 


B occurring 
(same time) 


complement 
of A, notA 


same 


same 


same 


same 


Chapter 
(1st used) 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Symbol 


IV 


Spoken 


the distribution 


of X 


binomial 


distribution 


geometric 


distribution 


hypergeometric 


dist. 


Poisson dist. 


Lambda 


greater than or 


equal to 


Meaning 


the random 
variable X 


same 


same 


same 


same 


same 


average of 


Poisson 
distribution 


same 


Chapter 
(1st used) 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Discrete 
Random 
Variables 


Continuous 
Random 
Variables 


Continuous 
Random 
Variables 


Continuous 
Random 
Variables 


Continuous 
Random 
Variables 


Continuous 
Random 
Variables 


Symbol 


IA 


Exp 


Spoken 


less than or 
equal to 


equal to 


not equal to 


f of x 


prob. density 
function 


uniform 
distribution 


exponential 
distribution 


Meaning 


same 


same 


same 


function of 
x 


same 


same 


same 


critical 
value 


Chapter 
(1st used) 


Continuous 
Random 
Variables 


Continuous 
Random 
Variables 


The Normal 
Distribution 


The Normal 
Distribution 


The Normal 
Distribution 


The Central 
Limit 
Theorem 


The Central 
Limit 
Theorem 


The Central 
Limit 
Theorem 
The Central 
Limit 
Theorem 


CLT 


Le 


Me 


Spoken 


f of x equals 


normal 
distribution 


Z-SCOre 


standard 
normal dist. 


Central Limit 


Theorem 


X-bar 


mean of X 


mean of X-bar 


Meaning 


same 


decay rate 
(for exp. 
dist.) 


same 


same 


same 


same 


the random 
variable X- 
bar 


the average 
of X 


the average 
of X-bar 


Chapter 
(1st used) 


The Central 
Limit 
Theorem 


The Central 
Limit 
Theorem 


The Central 
Limit 
Theorem 
The Central 
Limit 
Theorem 


Confidence 
Intervals 


Confidence 
Intervals 


Confidence 
Intervals 


Confidence 
Intervals 


Confidence 
Intervals 


Confidence 
Intervals 


Symbol 


Or 


Or 


dX 


CL 


CI 


EKBM 


df 


Spoken 


standard 
deviation of X 


standard 
deviation of X- 
bar 


sum of X 


sum of x 
confidence 
level 


confidence 
interval 


error bound for 
a mean 


error bound for 
a proportion 


student-t 
distribution 


degrees of 
freedom 


Meaning 


same 


same 


same 


same 


same 


same 


same 


same 


same 


same 


Chapter 
(1st used) 


Confidence 
Intervals 


Confidence 
Intervals 


Confidence 
Intervals 


Hypothesis 
Testing 


Hypothesis 
Testing 


Hypothesis 
Testing 


Hypothesis 
Testing 


Hypothesis 
Testing 


Hypothesis 
Testing 


Ay, 


Spoken 


student-t with 
a/2 area in 
right tail 


p-prime; p-hat 


q-prime; q-hat 


H-naught, H- 
sub 0 


H-a, H-sub a 


H-1, H-sub 1 


alpha 


beta 


X1-bar minus 
X2-bar 


Meaning 


same 


sample 
proportion 
of success 


sample 
proportion 
of failure 


null 
hypothesis 


alternate 
hypothesis 


alternate 
hypothesis 


probability 
of Type I 
error 


probability 
of Type II 
error 


difference 
in sample 
means 


Chapter 
(1st used) 


Chi-Square 
Distribution 


Linear 
Regression 
and 
Correlation 


Symbol 
1 — P2 
P,-P9 
P1 — p2 
x2 

O 

E 
y=a-+bx 
y 

Tr 


Spoken 


mu-1 minus 
mu-2 


P1-prime 
minus P2- 
prime 


pl minus p2 


Ky-square 


Observed 


Expected 


y equals a plus 
b-x 


y-hat 


correlation 
coefficient 


Meaning 


difference 
in 
population 
means 


difference 
in sample 
proportions 


difference 
in 


population 
proportions 


Chi-square 


Observed 
frequency 


Expected 
frequency 


equation of 
a line 


estimated 
value of y 


same 


Chapter 


(1st used) Symbol 
E 
SSE 
1.9s 

F- 

Distribution F 

and 

ANOVA 


Symbols and their Meanings 


Spoken 
error 


Sum of 
Squared Errors 


1.9 times s 


F-ratio 


Meaning 


same 


same 


cut-off 
value for 
outliers 


F ratio 


Formulas 

This module provides an overview of Statistics Formulas used as a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. 

Formula 

Factorial 


n! = n(n — 1)(n — 2)... (1) 


O!=1 
Formula 
Combinations 


=e 


Formula 
Binomial Distribution 


X~B(n,p) 


PI XB! pg © Mone 20,4 2 on05 
Formula 
Geometric Distribution 


X~-G(p) 


21 0, Ge x) = 9° p fore = 1.9. 3.33 
Formula 
Hypergeometric Distribution 


X~H (r,b,n) 
a) 
y cdi WO Gens se (te ; 
(=*) = ("5 
Formula 


Poisson Distribution 


X~P (u) 


P(X=2)=4> 


z! 
Formula 
Uniform Distribution 


X~U(a, b) 


f(X) =po,a<2<b 
Formula 


Exponential Distribution 
X ~ Exp(m) 


{= me" ame Oe 0 
Formula 
Normal Distribution 


X~N(p, 07) 


1 cea? 
fix) =——e ww , -w<2< 0 
oV or 


Formula 
Gamma Function 


DZ \af, ee dee ee 

r(4) = va 

I'(m +1) = m! for m, a nonnegative integer 
otherwise: (a + 1) = aI'(a) 

Formula 


Student-t Distribution 


X ~tae¢ 


Z-N(0,1) , Y-X2, ,n = degrees of freedom 
Formula 
Chi-Square Distribution 


X~X?2, 


n-2 —-2 
Vi () = Era) ,x > 0, n= positive integer and degrees of freedom 
2 


Formula 
F Distribution 


X ~Fat(n),df(d) 
df(n) =degrees of freedom for the numerator 


df(d) =degrees of freedom for the denominator 


Y, 
Pe — We > Y, W are chi-square 


